New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better is_multipath_path function (issue #2298) plus some other stuff #2299
Better is_multipath_path function (issue #2298) plus some other stuff #2299
Conversation
… as disklayout.conf header plus TODO comments in layout/prepare/default/010_prepare_files.sh
With the
cf. #2298 (comment) As far as I see the initial comment header in disklayout.conf
that changes each time when "rear mkrescue" is run |
I did a "rear recover" test on my laptop
BUT |
With the essential help of a colleague the The missing piece on /dev/sda that got destroyed by "rear recover"
How things looked in the ReaR recovery system Before:
After:
The difference is
By default neither I prefer
so I will add |
…play binary files is needed in the recovery system
The fix for the
|
@rear/contributors |
If there are no objections I merge it today afternoon. |
@schabrolles In particular at my change of the is_multipath_path function: Perhaps you see something that is obviously wrong. I would like to merge it today afternoon if there are no objections. |
@schabrolles I appreciate it because I know you have |
# so that no "multipath -l" output could clutter the log (the "multipath -l" output is irrelevant here) | ||
# in contrast to e.g. test "$( multipath -l )" that would falsely succeed with blank output | ||
# and the output would appear in the log in 'set -x' debugscript mode: | ||
multipath -l | grep -q '[[:alnum:]]' || return 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reintroduces a performance problem that was fixed in PR #2034. multipath -l
scans all devices and the time it takes is proportional to their number, so the total time spent on all this is quadratic in the number of devices. For 1600 devices it used to take 300 s, now it takes 4445 s. See #2597 (comment), #2597 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First and foremost:
I am not at all a multipath expert
but nevertheless somehow it is mostly me
who has to deal with bad multipath code
and I get step by step more and more annoyed about that.
I neither see how I re-introduced the performance problem
nor do I see how the multipath -l
performance problem
was fixed by #2034
because as far as I see its changes do not contain multipath -l
but as multipath noob I can easily miss imortant things here.
Furthermore we use multipath -l
at other places in our code
and as far as I know this is the only simple and reliably working
commad to find out in general whether or not multipath is used at all.
So I introduced the multipath -l
test to avoid that the subsequent
wrong working multipath -c
when multipath is not used
can mess up all on "normal" (i.e. non-multipath) systems.
This pull request was even approved by @schabrolles
who was our multipath expert who maintained it so well in ReaR.
But unfortunately he gets no longer time from his employer IBM
to further maintain ReaR which proves things are OK as is for IBM.
I mean:
It is those big systems with multipath where IBM makes money with
but when IBM does no longer care about ReaR
then IBM gets what they pay for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jsmeix , sorry for the wrong formulation. The problem is new, but the problem fixed #2034 was also due to quadratic time complexity in multipath code, so the result is indistinguishable (inordinate amount of time spent creating the rescue image when there are lots of multipath devices) and for the end user (or a quality engineer) it looks like a regression.
Concerning
I am not at all a multipath expert
but nevertheless somehow it is mostly me
who has to deal with bad multipath code
and I get step by step more and more annoyed about that.
I am also not a multipath expert, but I may slowly become one, because I also have to deal with multipath issues in ReaR from time to time. So, feel free to assign multipath issues to me, or tag me for review of changes.
Concerning this particular change: do you have a test case for the problem that it fixed (#2298)? I will need to verify that any fix I am going to make does not reintroduce the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I'm still in vacation ;) but I had a quick look on it. If I understand well, the main main issue here is the fact that multipath -l
is listing all the devices and can some time when a huge number of devices are connected to the server. And because this function will be executed for each device it's even worse.
My 2 cents:
-
@pcahyna, could you try on your system with a lot of devices if the following command is quicker than
multipath -l
dmsetup ls --target multipath
-
The other other option would be to remove this test from "the loop" for each device.
May be create a first test that create a variable likemultipath_device_count=$(dmsetup ls --target multipath | wc -l)
.
This variable could be use to enter into specific code to treat multipath devices.
And sorry again if my contribution to rear was really minimal those days. I was never paid by IBM to do it, I'm a presale and my main activity today is around Openshift on Power. So it is more difficult for me to taking care of rear. It is on my very reduced personal time, on top of my job activity, but I also need to take care of my family and myself. thanks for your understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pcahyna
thank you for your explanatory reply.
I have no (automated) test cases.
I only test manually (I know, I know ... ;-)
In particular #2298
never happened on my virtual machines.
For me it only happened on one specific laptop.
Also #2321
was on real hardware "ThinkSystem SR630".
So an automated test that is run on virtual machines
would likely have never found that particular issuse.
But I have an idea how to generically solve
the underlying generic issue:
The underlying generic issue is that
no multipath code should be run when multipath is not used.
I.e. on non-multipath systems all multipath code should be skipped.
To do that it is sufficient to run a test only once
to find out whether or not multipath is used
and remember the result (in a global variable)
and then only use that global variable.
So we could have a new function like
(just off the top of my head to show the idea - not at all tested):
function is_multipath_used {
# 'multipath -l' is the only simple and reliably working commad
# to find out in general whether or not multipath is used at all.
# But 'multipath -l' scans all devices and the time it takes is proportional
# to their number so that time would become rather long (seconds up to minutes)
# if 'multipath -l' was called for each one of hundreds or thousands of devices.
# So we call 'multipath -l' only once and remember the result
# in a global variable and then only use that global variable
# so we can call is_multipath_used very many times as often as needed.
is_true $MULTIPATH_IS_USED && return 0
is_false $MULTIPATH_IS_USED && return 1
# When MULTIPATH_IS_USED has neither a true nor false value set it and return accordingly.
# Because "multipath -l" always returns zero exit code we check if it has real output via grep -q '[[:alnum:]]'
# so that no "multipath -l" output could clutter the log (the "multipath -l" output is irrelevant here)
# in contrast to e.g. test "$( multipath -l )" that would falsely succeed with blank output
# and the output would appear in the log in 'set -x' debugscript mode:
if multipath -l | grep -q '[[:alnum:]]' ; then
MULTIPATH_IS_USED='yes'
return 0
else
MULTIPATH_IS_USED='no'
return 1
fi
}
plus
MULTIPATH_IS_USED=''
in default.conf together with an explanatory comment
to provide "final power to the user".
Caution!
I think during "rear recover" multipath -l
may not report the truth
unless layout/prepare/GNU/Linux/210_load_multipath.sh was run
so during "rear recover" the function is_multipath_used must not
be called after layout/prepare/GNU/Linux/210_load_multipath.sh
was run which happens not at the very beginning of "rear recover"
so all scripts before layout/prepare/GNU/Linux/210_load_multipath.sh
must not contain multipath code because all multipath code should
be skipped on non-multipath systems by using is_multipath_used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @schabrolles , my ideas for making the test quicker and more reliable are described in #2597 (comment) (test whether lsblk --nodeps -o fstype /dev/sd...
shows mpath_member
) and #2597 (comment) (test whether there is a /sys/block/*/holders/*
symlink pointing to the multipath device). I suppose dmsetup
would work as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@schabrolles
happy to hear from you again!
No need at all to say sorry that you can no longer maintain ReaR.
It is IBM who should say sorry that they do not support ReaR.
SUSE supports ReaR.
Red Hat supports ReaR.
But IBM itself does not (directly) support it.
Same for HP.
Same for all other "big server" manufacturers.
I didn't see your posting because I was about writing mine.
Your proposal "2." is essentially what also I suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@schabrolles
of course you and your family have way much higher priority
than any piece of software could ever have
(living humans versus "dead" programs).
Of course I don't expect any further work for ReaR from you.
I do much appreciate all what you contributed.
So just enjoy your vacation - i.e. relax and recover!
I wish you and your family all the very best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pcahyna
if you think something like a is_multipath_used function is useful
it could of course implement any test that is appropriate,
perhaps even several tests could be needed
to make it work also on old systems where e.g.
lsblk --nodeps -o fstype /dev/sd...
may not yet work
so e.g. multipath -l
might be used as final fallback
(or not at all when other tests work better in general).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multipath -l is very slow with many multipath devices. As it will be called for every multipath device, it leads to quadratic time complexity in the number of multipath devices. For thousands of devices, ReaR can take hours to scan and exclude them. We therefore have to comment multipath -l out, as it is a huge performance regression, and find another solution to bug rear#2298.
Type: Bug Fix / Cleanup
Impact: High
In my case disklayout.conf became totally useless.
If the '/dev/sda' entries were only commented out
the bad disklayout.conf could still be manually adapted
but without any '/dev/sda' entries it is useless in practice
(I cannot guess during "rear recover" what right values could be).
Reference to related issue (URL):
Normal /dev/sda by "multipath -c /dev/sda" falsely recognized as multipath device (DM_MULTIPATH_DEVICE_PATH="1") #2298
How was this pull request tested?
Currently untested preliminary initial state.
"rear mkrescue" looks goot to me with that changes
but I need to test "rear recover" (hopefully tomorrow).
Brief description of the changes in this pull request:
Better (i.e. hopefully more fail safe)
is_multipath_path
function, cf.#2298 (comment)
Have
lsblk
output as disklayout.conf header commentsso that it is easier to make sense of the values in the subsequent entries.
Some TODO comments added in layout/prepare/default/010_prepare_files.sh
because I wonder about how some things therein are meant to work.
Added
xdd
(belongs to vi) to the PROGS array because alsoa tool to display binary files is needed in the recovery system.