Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

udev: make algorithm that selects highest priority devlink less susceptible to race conditions #17431

Merged
merged 22 commits into from
Nov 10, 2020

Conversation

msekletar
Copy link
Contributor

No description provided.

@msekletar msekletar added the udev label Oct 23, 2020
@msekletar
Copy link
Contributor Author

msekletar commented Oct 23, 2020

@mwilck I rebased your udev-test.pl work on top of current master and implemented different algorithm to avoid race-condition. Also thanks to @poettering for all the suggestions!

Copy link
Member

@poettering poettering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love where this is going (obviously), some comments though

src/udev/udev-node.c Outdated Show resolved Hide resolved
src/udev/udev-event.c Outdated Show resolved Hide resolved
src/udev/udev-node.c Outdated Show resolved Hide resolved
src/udev/udev-node.c Outdated Show resolved Hide resolved
src/udev/udev-node.c Outdated Show resolved Hide resolved
@poettering poettering added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Oct 23, 2020
@msekletar msekletar force-pushed the udev-link-update-race branch 3 times, most recently from c02230d to f9e765c Compare November 2, 2020 10:08
@msekletar
Copy link
Contributor Author

Updated. @poettering PTAL.

As agreed offline we now run algorithm (update_devnode()) twice, first time we may not be successful but we will correct symlinks momentarily because we run second time after the udev db is fully updated.

Allow testing cases where multiple devices are added and removed.
This implies a change of the data structure: every test allows
for multiple devices to be added, and "exp_name" etc. are now properties
of the device, not of the test.
It's not necessary to write the rules for every udev run, as we
now may have many (rather than just 2) per test.
Allow testing cases where multiple devices are added and removed
simultaneously. Tests are started as synchronously as possible using a
semaphore, in order to test possible race conditions. If this isn't desired,
the test parameter "sleep_us" can be set to the number of microseconds to wait
between udev invocations.
More often than not, the created devnode is the basename of the
sysfs entry. The "devnode" device may be used to override the
auto-detected node name.

Permissions and major/minor number are now verified on the devnode
itself, not on symlinks.

For those tests where exp_name is set to the computed devnode name,
the explicit "exp_name" can be removed. "exp_name" is only required for
symlinks.

This allows separate testing for devnodes and symlinks an a follow-up
patch.
Test if symlinks are created correctly by comparing the symlink
targets to the devnode path. This implies (for the symlink) that
major/minor numbers and permissions are correct, as we have tested
that on the devnode already.
Instead of testing the existence or non-exisitence of just a single
symlink, allow testing of several links per device.

Change the test definitions accordingly.
udev hasn't supported renaming device nodes for some time.
the "last_rule" option hasn't been supported for some time.
Therefore this test fails if a "not_exp_links" attribute is added,
as it should be. Mark it appropriately.
Add some rules that make it a bit harder to pass, mainly the
non-existence checks.
These rules have survived from an ancient version of the code
and save no purpose any more.
As we can check multiple links in a single test now, these 3
tests can be merged into one.
As we can test multiple devices and multiple links per device
in one test now, these two tests can be merged into one.
This is helpful to catch possible regressions in the test.
Also, don't count wait() errors, they are likely not udev errors.
Add 4 new tests using multiple devices. Number 2-4 use many
devices claiming the same symlink, where only one device has
a higher priority thatn the others. They fail sporadically with
the current code, if a race condition causes the symlink to point
to the wrong device. Test 4 is like test 2 with sleeps in between,
it's much less likely to fail.
for easier reproduction of sporadic test failures.
Manually listing all devices in the test definition becomes cumbersome with
lots of devices. Add a function that scans on all block devices in
the test sysfs and generates a list of devices to test.
Script for generating LOTS of SCSI disk and partition devices in
the fake sysfs we use for udev testing.

This script is supposed to be run after sys-script.py. It uses
code from sys-script.py as template to generate additional SCSI disk data
structures. Together with the "generator" code in udev-test.pl
added in the previous patch, it allows to run udev tests with almost
arbitrarily many devices, and thus to do performance scaling tests.
umount emits an error message "no mount point specified" if the
tmpfs isn't mounted yet, which is the normal case.
Suppress that by redirecting stderr.
Since 'test/udev-test.pl: count "good" results', we know how many
checks succeeded. Add an "expected good" count to make that number
more meaningful.
@mwilck
Copy link
Contributor

mwilck commented Nov 2, 2020

Hi Michael,

I finally came down to testing your code. First I made sure the issue reproduced with current master (it does, easily). Then I tested your tree with 50 devices (python sd-test.py . 50), and it worked well 👍 . With 200 devices, it also succeeds, but we got scaling issues again.

perf top shows output like this:

+   70,32%     0,00%  test-udev                 [.] link_update                                                                       
-   69,95%     0,11%  test-udev                 [.] link_find_prioritized                                                             
+   47,92%     1,35%  [kernel]                  [k] entry_SYSCALL_64_after_hwframe                                                    
+   47,91%     0,07%  libsystemd-shared-246.so  [.] sd_device_new_from_device_id                                                      
+   47,07%     0,13%  libsystemd-shared-246.so  [.] sd_device_new_from_devnum                                                         
+   46,89%     0,04%  libsystemd-shared-246.so  [.] sd_device_new_from_syspath                                                        
+   46,67%     1,20%  [kernel]                  [k] do_syscall_64                                                                     
+   44,15%     0,12%  libsystemd-shared-246.so  [.] device_set_syspath                                                                

IIRC this is not worse than what I saw previously with my code - hard to compare anyway because the hardware has changed in the meantime.

Anyway, great job!

@mwilck
Copy link
Contributor

mwilck commented Nov 2, 2020

Some numbers from my test system (24 x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 64 GB RAM), from time udev-test.pl 161, all times in seconds:

number of devices real user sys
1 1.3 5.1 4.5
10 3.0 24 27
20 6.5 57 64
50 24 220 250
100 71 690 760
200 300 3000 3300

It scales roughly quadratically with the number of devices. Maybe the change in the directory structure I made in 6b2db24 would improve matters somewhat. Anyway, in practice there are rarely more than ~10 contenders for a symlink.

@poettering poettering added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Nov 4, 2020
@poettering
Copy link
Member

i didn't review the perl stuff, I am not a perlguru enough for this

@msekletar
Copy link
Contributor Author

@poettering @keszybz thanks for the reviews so far! PTAL at the newest version.

@msekletar
Copy link
Contributor Author

@keszybz I have no idea why hwdb-test fails. AFAICT, hwdb doesn't use any code that I've touched in this PR.

@msekletar msekletar removed the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Nov 5, 2020
src/udev/udev-node.c Outdated Show resolved Hide resolved
src/udev/udev-node.c Show resolved Hide resolved
@keszybz keszybz added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Nov 5, 2020
@msekletar msekletar force-pushed the udev-link-update-race branch 2 times, most recently from 01a7d8a to aaf0ef7 Compare November 5, 2020 20:08
@msekletar
Copy link
Contributor Author

@keszybz updated PTAL.

I removed initialisation of i,r and st1.

src/udev/udev-node.c Outdated Show resolved Hide resolved
src/basic/stat-util.c Outdated Show resolved Hide resolved
src/udev/udev-node.c Outdated Show resolved Hide resolved
src/udev/udev-node.c Show resolved Hide resolved
src/udev/udev-node.c Show resolved Hide resolved
Note that st_mtime member of struct stat is defined as follows,

 #define st_mtime st_mtim.tv_sec

Hence we omitted checking nanosecond part of the timestamp (struct
timespec) and possibly would miss modifications that happened within the
same second.
@msekletar msekletar force-pushed the udev-link-update-race branch 2 times, most recently from 084f427 to 52edc45 Compare November 6, 2020 12:27
…ptible to race conditions

Previously it was very likely, when multiple contenders for the symlink
appear in parallel, that algorithm would select wrong symlink (i.e. one
with lower-priority).

Now the algorithm is much more defensive and when we detect change in
set of contenders for the symlink we reevaluate the selection. Same
happens when new symlink replaces already existing symlink that points
to different device node.
@msekletar
Copy link
Contributor Author

@poettering @keszybz updated.

Thank you for all the suggestions and help.

@msekletar msekletar removed the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Nov 6, 2020
@keszybz keszybz merged commit b8aac50 into systemd:master Nov 10, 2020
@keszybz
Copy link
Member

keszybz commented Nov 10, 2020

@mwilck, @msekletar, good to see this done!

@mwilck
Copy link
Contributor

mwilck commented Feb 1, 2021

Eventually, we had the customer test this code who originally reported this race and made me suggest #9551 in the first place. Results are mixed, unfortunately. First of all, we see

[   27.101200] dl360g91 systemd-udevd[685]: sdv: Failed to update device symlinks: Too many levels of symbolic links
[   27.132792] dl360g91 systemd-udevd[625]: sdo: Failed to update device symlinks: Too many levels of symbolic links

This message is printed if one path looses the race for the symlink. The message "Too many levels of symbolic links", created from the -ELOOP return value from link_update(), is a bit misleading. Normally it indicates some sort of "symbolic link loop", but here it has a totally different meaning (maximum number of retries exceeded). This shows also that it's possible and not even very unlikely to exceed the maximum number of retries.

Note that this message is printed in a situation where only SCSI devices are contending for the same symlink. They all have the same priority. The issue here is, AFAICS, that every contender puts itself first in the list and will only give up if some other device claims a higher priority. It might be better to just use e.g. alphabetical ordering to avoid ping-pong between contenders.

msekletar added a commit to msekletar/systemd that referenced this pull request Mar 4, 2021
In PR systemd#17431 we have introduced retry loop in link_update() in order to
maximize the chance that we end up with correct target when there are
multiple contenders for given symlink.

Number of interations in retry loop is either 1 or
LINK_UPDATE_MAX_RETRIES, depending on the value of 'initialized' db
flag. When device appears for the first time we need to set the
flag before calling link_update() via update_devnode() for the second
time to make sure we run the second invocation with higher retry loop
counter.
msekletar added a commit to msekletar/systemd that referenced this pull request Mar 4, 2021
In PR systemd#17431 we have introduced retry loop in link_update() in order to
maximize the chance that we end up with correct target when there are
multiple contenders for given symlink.

Number of iterations in retry loop is either 1 or
LINK_UPDATE_MAX_RETRIES, depending on the value of 'initialized' db
flag. When device appears for the first time we need to set the
flag before calling link_update() via update_devnode() for the second
time to make sure we run the second invocation with higher retry loop
counter.
yuwata pushed a commit that referenced this pull request Mar 6, 2021
In PR #17431 we have introduced retry loop in link_update() in order to
maximize the chance that we end up with correct target when there are
multiple contenders for given symlink.

Number of iterations in retry loop is either 1 or
LINK_UPDATE_MAX_RETRIES, depending on the value of 'initialized' db
flag. When device appears for the first time we need to set the
flag before calling link_update() via update_devnode() for the second
time to make sure we run the second invocation with higher retry loop
counter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

4 participants