Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pid1: several fixlets for device handling #23218

Merged
merged 2 commits into from May 12, 2022
Merged

Conversation

yuwata
Copy link
Member

@yuwata yuwata commented Apr 28, 2022

Fixes #12953.
Fixes #23208.
Replaces #23215.

@yuwata yuwata added the pid1 label Apr 28, 2022
@yuwata yuwata force-pushed the core-device branch 2 times, most recently from be551fa to b93f9fd Compare April 29, 2022 11:33
@yuwata
Copy link
Member Author

yuwata commented Apr 29, 2022

cc @mwilck, @fbuihuu and @hreinecke.

@yuwata
Copy link
Member Author

yuwata commented Apr 29, 2022

I hope most commits, except for the last two, are trivial.

@yuwata yuwata changed the title [WIP] pid1: several fixlets for device handling pid1: several fixlets for device handling Apr 29, 2022
@yuwata yuwata marked this pull request as ready for review April 29, 2022 11:45
@yuwata
Copy link
Member Author

yuwata commented Apr 29, 2022

From yuwata@c7fd2d5#commitcomment-72476156

Thanks for looking into this! Note that my comment in systemd#23208 was a bit misleading perhaps. In the case I quoted, udev had indeed taken over the symlink. But that's not necessarily the case if you look at the udev code. A worker may or may not take over the symlink. In particular, systemd has no concept of udev's link priority.

IMO the only safe way for systemd to derive the current state of symlinks is to read the symlink from the filesystem, at least after each ADD/CHANGE/REMOVE uevent.

@mwilck Thank you for the comment. But, sorry, I cannot follow the comment. Could you elaborate more?

@mwilck
Copy link
Contributor

mwilck commented Apr 29, 2022

Can you explain how this relates to my #23215 ?

@mwilck
Copy link
Contributor

mwilck commented Apr 29, 2022

@mwilck Thank you for the comment. But, sorry, I cannot follow the comment. Could you elaborate more?

The code in link_update() tries to set the symlink to the device with the highest link priority. If priorities are equal, it prefers itself (the device for which it's executing). We know that this may result in races between different devices claiming the same symlink. If one device has higher link priority than all others, it will win. Regardless of which device "wins", all possible candidates will have the symlink listed in their DEVLINKS property. Thus if systemd processes an uevent with a certain devlink set, it can't be sure that the device it's currently processing actually owns this symlink.

In the case of multipath, SYSTEMD_READY="0" will be set for all devices except the winning device mapper device, which also has the highest priority. The same holds for MD RAID. IIUC this should cause the symlink to be set correctly in systemd. There's nothing that would enforce this behavior (setting SYSTEMD_READY="0" for all but the highest prio candidate) in udev rules, though.

In order to be certain to have the "right" device mapped to any symlink pseudo-device, systemd would need to read the symlinks that udev creates. The udev db itself doesn't contain this information, unless there's exactly one device with higher link prio than all others.

Alternatively, one might argue that if several devices claim the same symlink with the same prio, it doesn't really matter which one wins, and except for very rare situations it wouldn't really hurt that systemd and udev are in inconsistent state (if it did hurt, we'd see a lot more errors in this area than we actually do). If we accept this argument, it'd be sufficient if systemd read the link priority and mapped symlinks to the device with the highest prio, without having to actually read the symlinks.

@yuwata
Copy link
Member Author

yuwata commented Apr 29, 2022

Ah, indeed. Thanks. Will update.

Can you explain how this relates to my #23215 ?

The last commit is mostly equivalent to your PR, no?

@mwilck
Copy link
Contributor

mwilck commented Apr 29, 2022

The last commit is mostly equivalent to your PR, no?

Right ... I didn't realize this looking over the commit messages. Why didn't you just use my patch?

@mwilck
Copy link
Contributor

mwilck commented Apr 29, 2022

Right ... I didn't realize this looking over the commit messages. Why didn't you just use my patch?

I see you moved the flag modification from device_deserialize_item() to device_coldplug(). I suppose that's ok, but I'd prefer fixing the flags as soon as possible, thereby avoiding any risk to use the "wrong" flags in case the ordering of function calls in this part of the code changes in the future. Could you explain why you think doing it in device_coldplug() is better?

@fbuihuu
Copy link
Contributor

fbuihuu commented Apr 29, 2022

The last commit is mostly equivalent to your PR, no?

Right ... I didn't realize this looking over the commit messages. Why didn't you just use my patch?

@yuwata that doesn't sound really fair actually. As you can guess, Martin spent a lot o time investigating, analyzing and fixing the issues (introduced roughly 4 years ago by 66f3fdb that no one could fix since then). So at minimum it would be nice to give him credit of the detailed analyzis although I'm not sure to understand why you didn't provide feedback on his PR in the first place, that's what we usually do.

@yuwata
Copy link
Member Author

yuwata commented Apr 29, 2022

@yuwata that doesn't sound really fair actually. As you can guess, Martin spent a lot o time investigating, analyzing and fixing the issues (introduced roughly 4 years ago by 66f3fdb that no one could fix since then). So at minimum it would be nice to give him credit of the detailed analyzis although I'm not sure to understand why you didn't provide feedback on his PR in the first place, that's what we usually do.

Changing the author of the last commit to @mwilck is OK for me. I do not intend to steal his work.

@mwilck I'm sorry if you feel uncomfortable about the commit. Which way do you like; changing the author of the last commit to you and discuss further in this PR, or dropping the last commit from this PR and discuss about the change in your PR? Both are OK for me.

I see you moved the flag modification from device_deserialize_item() to device_coldplug(). I suppose that's ok, but I'd prefer fixing the flags as soon as possible, thereby avoiding any risk to use the "wrong" flags in case the ordering of function calls in this part of the code changes in the future. Could you explain why you think doing it in device_coldplug() is better?

In general, serialization/deserialization should be one-to-one. We should not modify values in the deserialization process, especially when it is conditionalized with the other parameter (in this case honor_device_enumeration flag) which is also serialized and deserialized. Of course, the flag is deserialized before deserializing units. So, at least currently, modifying the parameters in device_deserialize() and device_coldplug() are equivalent. But, considering possible future code changes, it is better to modify parameters when they are used in general, I think.

@yuwata
Copy link
Member Author

yuwata commented Apr 29, 2022

The code in link_update() tries to set the symlink to the device with the highest link priority. If priorities are equal, it prefers itself (the device for which it's executing).

Right.

We know that this may result in races between different devices claiming the same symlink.

The race should be solved by #23043. If you'd like, please take a look.

If one device has higher link priority than all others, it will win. Regardless of which device "wins", all possible candidates will have the symlink listed in their DEVLINKS property.

Right.

Thus if systemd processes an uevent with a certain devlink set, it can't be sure that the device it's currently processing actually owns this symlink.

It CAN, by using sd_device_new_from_devname().

In the case of multipath, SYSTEMD_READY="0" will be set for all devices except the winning device mapper device, which also has the highest priority. The same holds for MD RAID. IIUC this should cause the symlink to be set correctly in systemd. There's nothing that would enforce this behavior (setting SYSTEMD_READY="0" for all but the highest prio candidate) in udev rules, though.

Right. I forgot that in the previous version. Now, device_is_ready(), which checks SYSTEMD_READY=, is called after sd_device_new_from_devname().

In order to be certain to have the "right" device mapped to any symlink pseudo-device, systemd would need to read the symlinks that udev creates. The udev db itself doesn't contain this information, unless there's exactly one device with higher link prio than all others.

Right. So, sd_device_new_from_devname() reads the symlink.

Alternatively, one might argue that if several devices claim the same symlink with the same prio, it doesn't really matter which one wins, and except for very rare situations it wouldn't really hurt that systemd and udev are in inconsistent state (if it did hurt, we'd see a lot more errors in this area than we actually do). If we accept this argument, it'd be sufficient if systemd read the link priority and mapped symlinks to the device with the highest prio, without having to actually read the symlinks.

I am not sure. May be true, but may not.

@systemd systemd deleted a comment from lgtm-com bot Apr 29, 2022
@systemd systemd deleted a comment from lgtm-com bot Apr 29, 2022
@yuwata yuwata force-pushed the core-device branch 2 times, most recently from acb65eb to d6121f5 Compare April 29, 2022 19:29
@mwilck
Copy link
Contributor

mwilck commented Apr 29, 2022

The race should be solved by #23043. If you'd like, please take a look.

Interesting that you did it this way now... the basic idea looks similar to my ancient PRs #8667 and #9551, which have later been superseded by the "lockless" solution #17431. Back then, I had scaling issues with a very large number of symlink contenders (1000 and more) for a symlink. This was mostly due to bad scaling in the kernel, solved by fd7732e033e3 ("fs/locks: create a tree of dependent requests.") in kernel v5.0.

Right. So, sd_device_new_from_devname() reads the symlink.

Alright. Good to know. Your patch set here is pretty big, I still have to dig deeper into it. Next week, probably.

@mwilck
Copy link
Contributor

mwilck commented Apr 29, 2022

Thus if systemd processes an uevent with a certain devlink set, it can't be sure that the device it's currently processing actually owns this symlink.

It CAN, by using sd_device_new_from_devname().

I see. But now it can happen that when systemd processes an event for device A and reads the devlinks, the links point to device B. Different devlinks can even point to different devices C, D, … Not saying this is a problem, we just need to be aware of the possibility.

@yuwata

This comment was marked as outdated.

@yuwata yuwata added the ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR label Apr 30, 2022
@systemd systemd deleted a comment from lgtm-com bot Apr 30, 2022
@fbuihuu
Copy link
Contributor

fbuihuu commented May 18, 2022

The macro MANAGER_IS_SWITCHING_ROOT() is true only when the previous process in the shutting down stage. Hence, it is always false in device_coldplug() or device_catchup().

@yuwata I don't think so, otherwise this code would have never been able to pass my basic testing. See the call to manager_set_switching_root(m, arg_switched_root); in main().

The fourth commit is more interesting because it drops device_coldplug() entirely. Since this function was basically used to restore the serialized device properties, which is supposed to be done be device_deserialize_items(). Hence it let device_catchup() decides for the final state of the device and generate the state transition event when needed.

If the (deserialized) found and the enumerated_found are equivalent, then it does not call device_set_state()...

Well that's the whole point of the patch: "forget" all devices after switching root that are no more referenced by udev nor by systemd (mount/swap) units.

I think you should reread it more carefully. This version seems to be simpler, which is rather a good argument since the logic is already complicated.

Thanks.

@yuwata
Copy link
Member Author

yuwata commented May 18, 2022

@fbuihuu Could you open a PR? Then we can easily review and discuss about the code there.

@fbuihuu
Copy link
Contributor

fbuihuu commented May 18, 2022

@yuwata there seems to be a regression with your PR and OPTIONS+="db_persist". With such option a device is seen as dead after swittching root and hence generates a wrong device state transition dead->plugged.

@fbuihuu
Copy link
Contributor

fbuihuu commented May 18, 2022

@fbuihuu Could you open a PR? Then we can easily review and discuss about the code there.

OK I will.

@yuwata
Copy link
Member Author

yuwata commented May 18, 2022

@yuwata there seems to be a regression with your PR and OPTIONS+="db_persist". With such option a device is seen as dead after swittching root and hence generates a wrong device state transition dead->plugged.

If the device is not mounted, then such transition should occur. But, does it cause any issues?

@fbuihuu
Copy link
Contributor

fbuihuu commented May 18, 2022

It's not clear to me whether it will cause issue in practice (maybe @mwilck knows ?) but this dead->plugged transition for devices with db_persist doen't look correct and wasn't present before. And udev man page seems to indicate that this transition should not occur.

@mwilck
Copy link
Contributor

mwilck commented May 18, 2022

It's not clear to me whether it will cause issue in practice (maybe @mwilck knows ?) but this dead->plugged transition for devices with db_persist doen't look correct and wasn't present before. And udev man page seems to indicate that this transition should not occur.

Why would this even be mentioned in an the udev man page? It's genuine systemd functionality.

I have observed these transitions (I see them all the time actually because I work with multipath-tools, but I didn't bother about it), but definitely no negative impact.

I have a test case here where root FS (btrfs) is on top of LVM on top of MD RAID on top of multipath. Thus the multipath devices (which have db_persist set) are in use once root is mounted, but from the PoV of systemd, they have neither DEVICE_FOUND_MOUNT nor DEVICE_FOUND_SWAP set. Thus they start out "dead" after switching root, just like in @fbuihuu's example. I can see these dead→plugged transitions with loglevel debug, but everything works just fine during boot.

@mwilck
Copy link
Contributor

mwilck commented May 18, 2022

@yuwata. recall this discussion we had previously? If we removed the lines

+                if (found == DEVICE_NOT_FOUND)
+                        state = DEVICE_DEAD; /* If nobody sees the device, downgrade more */

from 75d7b59, we'd still see transition messages but only "tentative→plugged", not the confusing "dead→plugged". I don't think this can be completely avoided. We apply the result from enumeration after the result from deserialization on purpose.

@mwilck
Copy link
Contributor

mwilck commented May 18, 2022

Ah, and my example above shows that it can actually happen that deserialized_state == DEVICE_PLUGGED and (deserialized_found & ~DEVICE_FOUND_UDEV) == DEVICE_NOT_FOUND. (#23218 (comment)).

fbuihuu added a commit to fbuihuu/systemd that referenced this pull request May 19, 2022
This is a slightly different approach than the one taken by commit
75d7b59 to fix issue systemd#12953 and systemd#23208.

This patch forces PID1 to forget all devices (except those with the
"db_persist" option see below) that were known by PID1 before switching root by
pretending that the devices were in DEAD state before being serialized. Hence
no artificial "plugged->dead" state transitions happen when PID1 is reexecuting
from a switch root followed by "dead->plugged" state transitions when all
devices are coldplugged with the new set of udev rule from the host.

As mentioned previously, devices with the "db_persistent" option are exceptions
of the previously described mechanism. Since these devices remain in the udev
DB even after the DB has been cleared, they still continue to be deserialized
in plugged state and remain in this state hence following the description of
the option. This should fix the regression introduced by
75d7b59.

Fixes: systemd#23429
Replaces: systemd#23218
@keszybz keszybz removed the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label May 19, 2022
mrc0mmand added a commit to mrc0mmand/systemd that referenced this pull request May 25, 2022
This should cover cases regarding devices with `OPTIONS+="db_persist"`
during initrd->sysroot transition.

See:
  * systemd#23429
  * systemd#23218
  * https://bugzilla.redhat.com/show_bug.cgi?id=2087225
mrc0mmand added a commit to mrc0mmand/systemd that referenced this pull request May 25, 2022
This should cover cases regarding devices with `OPTIONS+="db_persist"`
during initrd->sysroot transition.

See:
  * systemd#23429
  * systemd#23218
  * systemd#23489
  * https://bugzilla.redhat.com/show_bug.cgi?id=2087225
yuwata pushed a commit to mwilck/systemd-1 that referenced this pull request May 25, 2022
This should cover cases regarding devices with `OPTIONS+="db_persist"`
during initrd->sysroot transition.

See:
  * systemd#23429
  * systemd#23218
  * systemd#23489
  * https://bugzilla.redhat.com/show_bug.cgi?id=2087225
yuwata pushed a commit to yuwata/systemd-stable that referenced this pull request May 26, 2022
This should cover cases regarding devices with `OPTIONS+="db_persist"`
during initrd->sysroot transition.

See:
  * systemd/systemd#23429
  * systemd/systemd#23218
  * systemd/systemd#23489
  * https://bugzilla.redhat.com/show_bug.cgi?id=2087225
(cherry picked from commit 1fb7f8e)
yuwata pushed a commit to systemd/systemd-stable that referenced this pull request May 27, 2022
This should cover cases regarding devices with `OPTIONS+="db_persist"`
during initrd->sysroot transition.

See:
  * systemd/systemd#23429
  * systemd/systemd#23218
  * systemd/systemd#23489
  * https://bugzilla.redhat.com/show_bug.cgi?id=2087225
(cherry picked from commit 1fb7f8e)
@yuwata
Copy link
Member Author

yuwata commented Aug 12, 2022

#13775 (comment) requests to backport 75d7b59. Setting the backport label.

@keszybz
Copy link
Member

keszybz commented Oct 1, 2022

@yuwata I know this and #23517 have been requested for backports, but I don't feel confident enough to do it myself. Could you open a PR against v250-stable?

@yuwata
Copy link
Member Author

yuwata commented Oct 1, 2022

Sure, I will.

@keszybz
Copy link
Member

keszybz commented Nov 3, 2022

@yuwata ping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 participants