New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pid1: several fixlets for device handling #23218
Conversation
be551fa
to
b93f9fd
Compare
cc @mwilck, @fbuihuu and @hreinecke. |
I hope most commits, except for the last two, are trivial. |
From yuwata@c7fd2d5#commitcomment-72476156
@mwilck Thank you for the comment. But, sorry, I cannot follow the comment. Could you elaborate more? |
Can you explain how this relates to my #23215 ? |
The code in link_update() tries to set the symlink to the device with the highest link priority. If priorities are equal, it prefers itself (the device for which it's executing). We know that this may result in races between different devices claiming the same symlink. If one device has higher link priority than all others, it will win. Regardless of which device "wins", all possible candidates will have the symlink listed in their In the case of multipath, In order to be certain to have the "right" device mapped to any symlink pseudo-device, systemd would need to read the symlinks that udev creates. The udev db itself doesn't contain this information, unless there's exactly one device with higher link prio than all others. Alternatively, one might argue that if several devices claim the same symlink with the same prio, it doesn't really matter which one wins, and except for very rare situations it wouldn't really hurt that systemd and udev are in inconsistent state (if it did hurt, we'd see a lot more errors in this area than we actually do). If we accept this argument, it'd be sufficient if systemd read the link priority and mapped symlinks to the device with the highest prio, without having to actually read the symlinks. |
Ah, indeed. Thanks. Will update.
The last commit is mostly equivalent to your PR, no? |
Right ... I didn't realize this looking over the commit messages. Why didn't you just use my patch? |
I see you moved the flag modification from |
@yuwata that doesn't sound really fair actually. As you can guess, Martin spent a lot o time investigating, analyzing and fixing the issues (introduced roughly 4 years ago by 66f3fdb that no one could fix since then). So at minimum it would be nice to give him credit of the detailed analyzis although I'm not sure to understand why you didn't provide feedback on his PR in the first place, that's what we usually do. |
Changing the author of the last commit to @mwilck is OK for me. I do not intend to steal his work. @mwilck I'm sorry if you feel uncomfortable about the commit. Which way do you like; changing the author of the last commit to you and discuss further in this PR, or dropping the last commit from this PR and discuss about the change in your PR? Both are OK for me.
In general, serialization/deserialization should be one-to-one. We should not modify values in the deserialization process, especially when it is conditionalized with the other parameter (in this case |
Right.
The race should be solved by #23043. If you'd like, please take a look.
Right.
It CAN, by using
Right. I forgot that in the previous version. Now,
Right. So,
I am not sure. May be true, but may not. |
acb65eb
to
d6121f5
Compare
Interesting that you did it this way now... the basic idea looks similar to my ancient PRs #8667 and #9551, which have later been superseded by the "lockless" solution #17431. Back then, I had scaling issues with a very large number of symlink contenders (1000 and more) for a symlink. This was mostly due to bad scaling in the kernel, solved by fd7732e033e3 ("fs/locks: create a tree of dependent requests.") in kernel v5.0.
Alright. Good to know. Your patch set here is pretty big, I still have to dig deeper into it. Next week, probably. |
I see. But now it can happen that when systemd processes an event for device A and reads the devlinks, the links point to device B. Different devlinks can even point to different devices C, D, … Not saying this is a problem, we just need to be aware of the possibility. |
This comment was marked as outdated.
This comment was marked as outdated.
@yuwata I don't think so, otherwise this code would have never been able to pass my basic testing. See the call to
Well that's the whole point of the patch: "forget" all devices after switching root that are no more referenced by udev nor by systemd (mount/swap) units. I think you should reread it more carefully. This version seems to be simpler, which is rather a good argument since the logic is already complicated. Thanks. |
@fbuihuu Could you open a PR? Then we can easily review and discuss about the code there. |
@yuwata there seems to be a regression with your PR and |
OK I will. |
If the device is not mounted, then such transition should occur. But, does it cause any issues? |
It's not clear to me whether it will cause issue in practice (maybe @mwilck knows ?) but this |
Why would this even be mentioned in an the udev man page? It's genuine systemd functionality. I have observed these transitions (I see them all the time actually because I work with multipath-tools, but I didn't bother about it), but definitely no negative impact. I have a test case here where root FS (btrfs) is on top of LVM on top of MD RAID on top of multipath. Thus the multipath devices (which have |
@yuwata. recall this discussion we had previously? If we removed the lines
from 75d7b59, we'd still see transition messages but only "tentative→plugged", not the confusing "dead→plugged". I don't think this can be completely avoided. We apply the result from enumeration after the result from deserialization on purpose. |
Ah, and my example above shows that it can actually happen that |
This is a slightly different approach than the one taken by commit 75d7b59 to fix issue systemd#12953 and systemd#23208. This patch forces PID1 to forget all devices (except those with the "db_persist" option see below) that were known by PID1 before switching root by pretending that the devices were in DEAD state before being serialized. Hence no artificial "plugged->dead" state transitions happen when PID1 is reexecuting from a switch root followed by "dead->plugged" state transitions when all devices are coldplugged with the new set of udev rule from the host. As mentioned previously, devices with the "db_persistent" option are exceptions of the previously described mechanism. Since these devices remain in the udev DB even after the DB has been cleared, they still continue to be deserialized in plugged state and remain in this state hence following the description of the option. This should fix the regression introduced by 75d7b59. Fixes: systemd#23429 Replaces: systemd#23218
This should cover cases regarding devices with `OPTIONS+="db_persist"` during initrd->sysroot transition. See: * systemd#23429 * systemd#23218 * https://bugzilla.redhat.com/show_bug.cgi?id=2087225
This should cover cases regarding devices with `OPTIONS+="db_persist"` during initrd->sysroot transition. See: * systemd#23429 * systemd#23218 * systemd#23489 * https://bugzilla.redhat.com/show_bug.cgi?id=2087225
This should cover cases regarding devices with `OPTIONS+="db_persist"` during initrd->sysroot transition. See: * systemd#23429 * systemd#23218 * systemd#23489 * https://bugzilla.redhat.com/show_bug.cgi?id=2087225
This should cover cases regarding devices with `OPTIONS+="db_persist"` during initrd->sysroot transition. See: * systemd/systemd#23429 * systemd/systemd#23218 * systemd/systemd#23489 * https://bugzilla.redhat.com/show_bug.cgi?id=2087225 (cherry picked from commit 1fb7f8e)
This should cover cases regarding devices with `OPTIONS+="db_persist"` during initrd->sysroot transition. See: * systemd/systemd#23429 * systemd/systemd#23218 * systemd/systemd#23489 * https://bugzilla.redhat.com/show_bug.cgi?id=2087225 (cherry picked from commit 1fb7f8e)
#13775 (comment) requests to backport 75d7b59. Setting the backport label. |
Sure, I will. |
@yuwata ping |
Fixes #12953.
Fixes #23208.
Replaces #23215.