New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix multiple vg recreation #2691
Conversation
Fix a problem introduced in commits b184194 and 311bfb3 (PR rear#1806) that shows up when there are multiple VGs to restore. Using variables create_thin_volumes_only and create_logical_volumes to propagate information from VG creation to LV creation does not work well in the case of multiple VGs, because the variables are global and if there are multiple VGs, their values will leak from one VG to another. The generated diskrestore.sh script does not guarantee that the LVs of a given VG are created immediately after their VG and before creating another VG. Currently, the script first creates all VGs and then all LVs, so all the LVs in all VGs will see the value of create_logical_volumes and create_thin_volumes_only from the last VG, not from their own. This matters when different VGs behave differently (typically if one has a thin pool and the other does not). Fix by replacing the scalar values by arrays of VG names. If a given VG is in the array, it is the equivalent of the former scalar value being 1 for the given VG, if it is not in the array, it is an equivalent of a former value of 0. For the create_volume_group variable the change is not needed, but do it nevertheless for symmetry with other variables.
…l/Volume Using 'vgcfgrestore' to restore Thin LVs doesn't work, because there isn't the metadata in the configuration file. To be able to use Thin LVs, upon failure of 'vgcfgrestore' (or migration), traditional 'vgcreate/lvcreate' are used instead. This approach is sub-optimal because it requires to know all the possible options to 'lvcreate', which is hard/impossible to do. Nonetheless, this is better than nothing since the issue with 'lvcreate' options was already there in case of migration. The code using 'lvcreate' only runs upon 'vgcfgrestore' failure, which typically happens only with Thin LVs (IMHO). Hence, on most setups, the optimized code will still run, preventing any regression to happen. Signed-off-by: Renaud Métrich <rmetrich@redhat.com>
|
Hello @rmetrich, you introduced the original code, could you please have a look at the change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
| @@ -30,15 +30,6 @@ create_disk() { | |||
| [[ -b "$disk" ]] | |||
| BugIfError "Disk $disk is not a block device." | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you (while you are at it) by the way replace all *IfError function calls
by plain normal bash code, see my comment at the *IfError functions about
"Using the ...IfError functions can result unexpected behaviour in certain cases"
in usr/share/rear/lib/_input-output-functions.sh that is currently at
https://github.com/rear/rear/blob/master/usr/share/rear/lib/_input-output-functions.sh#L773
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsmeix done. I used this opportunity to move the check for block device inside the layout script, because I like to develop the code by running the ReaR scripts outside the system to be restored, and for this I need that they do not touch the hardware (that only the generated layout script does). This way I can take a saved layout information, run a modified script on it, and examine the resulting generated layout script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is OK to move checks into diskrestore.sh as you think it is best
but I think we must not call Error() or BugError() in diskrestore.sh
see #2691 (comment)
I think it works inside diskrestore.sh to call e.g. LogPrintError()
to show an error message to the user and then exit the
diskrestore.sh script normally with non zero exit code
or by calling false and let it exit via set -e.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call e.g. LogPrintError()
to show an error message to the user and then exit the
diskrestore.sh script normally with non zero exit code
I think it is worth to have a helper function for this. Now, the question is whether to introduce new function(s), called maybe SubshellError() and SubshellBugError() and call them from the layout script, or whether make Error() and BugError() smarter by teaching them to DTRT when called from inside the layout script (this could be achieved by setting a dedicated variable at the start of the layout script, in 540_generate_device_code.sh, and then examining it in Error() and BugError()). The advantage of the latter approach would be that in the former approach, you need to classify functions into two categories: those called from the layout (or similar) script and those called from the regular ReaR code, use SubshellError()/SubshellBugError() in the former and Error()/BugError() in the latter and never mix them together.
(The same problem arises even without introducing helpers like SubshellError()/SubshellBugError(). Even if performing what you propose manually (call e.g. LogPrintError() to show an error message to the user and then exit the diskrestore.sh script normally with non zero exit code) you still need to classify functions according to whether one should do this or one is allowed to use Error()/BugError() normally.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking when I found out about this problem here
was to make Error() smarter.
I think only Error() need to be enhanced
but no other functions like BugError() that call Error()
because all what matter is the code in Error() that exits
the whole running rear program with all its child processes.
I am even thinking about no longer exit the whole ReaR
during "rear recover" by default when Error() is called
because this problem here (Error exit during diskrestore.sh)
shows that likely it is basically always wrong to
exit the whole ReaR during "rear recover" because
that results basically always bad UX for the user
because I think exit the whole ReaR during "rear recover"
is never helpful for the user - what should he do in that case?
In contrast clear ERROR messages during "rear recover"
with the option to proceed (cf. "Final power to the user!")
may help the user to shomehow complete "rear recover"
and afterwards fix the issues that lead to the error messages
or deliberately ignore them, reboot, and hope for the best...
I think the "How to deal with errors during 'rear recover'?" area
should be properly investigated via a new separated issue
#2695
@pcahyna
I leave it to you if you prefer to keep the current BugError
in this pull request here or to change it into a LogPrintError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am even thinking about no longer exit the whole ReaR
during "rear recover" by default when Error() is called
because this problem here (Error exit during diskrestore.sh)
shows that likely it is basically always wrong to
exit the whole ReaR during "rear recover" because
that results basically always bad UX for the user
Your reasoning may apply to Error() but not to BugError(). There is not much that can be done when a code bug is detected - all bets are off in this situation and nasal daemons may fly, as you can never be certain what the bug is and how to handle it. After all, calling assert() or a similar function that aborts in places that are considered unreachable is a common practice, and a good one, as it permits detecting problems early instead of continuing with bad data. So I think that some variant of BugError() that aborts ReaR will always have its uses.
Of course, this does not mean that in the layout script BugError() needs to abort the whole ReaR. It is reasonable to abort only the script and let the user fix it (the script is meant to be adjustable by the user, after all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I leave it to you if you prefer to keep the current BugError
in this pull request here or to change it into a LogPrintError.
I prefer to keep it as it is and enhance Error() (or BugError()) separately to behave properly in scripts executed in subshells.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not about subshells.
Error() should (hopefully) behave properly in scripts executed in subshells,
see my comments in its code - I treid to make it behave properly.
It is about whether or not enforced abort of whole ReaR during "rear recover"
is helpful for the user compared to provide final power to the user
to let him decide as he wants to deal with an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not about subshells.
By "behave properly in scripts executed in subshells" I meant, behave properly in the case of scripts like diskrestore.sh (are there any other scripts like this, that are dynamically generated and users can review and edit them before executing?) whose failure should not abort the whole ReaR process. I did not mean an arbitrary script sourced in an arbitrary subshell, sorry for the confusion.
After this change, nothing in 100_include_partition_code.sh requires the actual hardware anymore - only the saved layout. This enables running the script outside of the rescue environment for testing, if one has the saved layout data, and thus facilitates testing and experimenting a lot. Functional change: if the disk is not found, it is now the layout script that aborts with BugError, not whole ReaR.
| cat >> "$LAYOUT_CODE" <<EOF | ||
|
|
||
| # | ||
| # Code handling disk '$disk' | ||
| # | ||
|
|
||
| ### Disks should be block devices. | ||
| [ -b "$disk" ] || BugError "Disk $disk is not a block device." | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think BugError should not be called here
because the code is
create_disk() {
local component disk size label junk
read component disk size label junk < <(grep "^disk $1 " "$LAYOUT_FILE")
cat >> "$LAYOUT_CODE" <<EOF
#
# Code handling disk '$disk'
#
### Disks should be block devices.
[ -b "$disk" ] || BugError "Disk $disk is not a block device."
so the BugError call happens inside the diskrestore.sh script
but usually BugError exits the whole running rear program
with all its child processes in particular via the function
terminate_descendants_from_children_to_grandchildren()
while any error in the diskrestore.sh script that runs with set -e
should only exit the diskrestore.sh script and then there is
a user dialog in the still running rear program
via layout/recreate/default/200_run_layout_code.sh
about the error exit of the diskrestore.sh script
so the user could even fix the diskrestore.sh script
and let the fixed one run again.
There may be other possibly wrong calls of BugError() or Error()
within code that gets run inside the diskrestore.sh script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsmeix I thought about that and decided that it is OK, because in several other places the layout script can call BugError (indirectly through create_disk_label and create_disk_partition from layout-functions.sh). Of course, if those other calls are wrong, this changes things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pcahyna
I had never thought before if it is OK or wrong to call Error()
(also indirectly via BugError) within the diskrestore.sh script
so I expected that this already happens somewhere directly
or indirectly in our code.
I did not check what actually happens in case of Error()
inside diskrestore.sh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way:
In general it is bad to error out the whole program from inside a function.
Normally a function should return a failure value to its caller
and let the caller decide what to do in this case.
Of course there are exceptions for example the function
assert_ISO_FILE_SIZE_LIMIT in lib/output-functions.sh currently at
https://github.com/rear/rear/blob/master/usr/share/rear/lib/output-functions.sh#L28
| # create_logical_volumes - LVs in the VG need to be created using the lvcreate command | ||
| # create_thin_volumes_only - when the previous condition is true, | ||
| # do not create volumes that are not thin volumes. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for that great explanation.
Now everyone understands why arrays are needed
so noone would have at a later time "a great idea"
to simplify that overcomplicated looking code ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve it, see
#2691 (comment)
|
@pcahyna |
In layout/prepare/GNU/Linux/110_include_lvm_code.sh fixed typo in comment 'grup' -> 'group' that is from #2691
| @@ -67,10 +66,35 @@ create_lvmgrp() { | |||
|
|
|||
| local vg=${vgrp#/dev/} | |||
|
|
|||
| # If a volume group name is in one of the following lists, it | |||
| # means that the particular condition is valid for this volume | |||
| # grup. If it is not in the list, it means that the condition is | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the typo in comment 'grup' -> 'group' via
02dad20
Pull Request Details:
Type: Bug Fix
Impact: Normal
Reference to related issue (URL): rmetrich@b184194#r56498696 rmetrich@311bfb3#r56498755
How was this pull request tested?
Backing up and recreating a system with two VGs, one contains thin pools, the other does not.
Brief description of the changes in this pull request:
Fix a problem introduced in commits b184194 and 311bfb3 (PR Issue #1380 - ReaR recovery fails when the OS contains a Thin Pool/Volume #1806) that shows up when there are multiple VGs to restore.
Using variables
create_thin_volumes_onlyandcreate_logical_volumesto propagate infromation from VG creation to LV creation does not work well in the case of multiple VGs, because the variables are global and if there are multiple VGs, their values will leak from one VG to another. The generated diskrestore.sh script does not guarantee that the LVs of a given VG are created immediately after their VG and before creating another VG. Currently, the script first creates all VGs and then all LVs, so all the LVs in all VGs will see the value ofcreate_logical_volumesandcreate_thin_volumes_onlyfrom the last VG, not from their own. This matters when different VGs behave differently (typically if one has a thin pool and the other does not).Fix by replacing the scalar values by arrays of VG names. If a given VG is in the array, it is the equivalent of the former scalar value being 1 for the given VG, if it is not in the array, it is an equivalent of a former value of 0.
For the
create_volume_groupvariable the change is not needed, but do it nevertheless for symmetry with other variables.