udev: make algorithm that selects highest priority devlink less susceptible to race conditions #17431

msekletar · 2020-10-23T14:57:35Z

No description provided.

msekletar · 2020-10-23T15:55:41Z

@mwilck I rebased your udev-test.pl work on top of current master and implemented different algorithm to avoid race-condition. Also thanks to @poettering for all the suggestions!

poettering

I love where this is going (obviously), some comments though

src/udev/udev-node.c

src/udev/udev-event.c

src/udev/udev-node.c

msekletar · 2020-11-02T10:09:04Z

Updated. @poettering PTAL.

As agreed offline we now run algorithm (update_devnode()) twice, first time we may not be successful but we will correct symlinks momentarily because we run second time after the udev db is fully updated.

Allow testing cases where multiple devices are added and removed. This implies a change of the data structure: every test allows for multiple devices to be added, and "exp_name" etc. are now properties of the device, not of the test.

It's not necessary to write the rules for every udev run, as we now may have many (rather than just 2) per test.

Allow testing cases where multiple devices are added and removed simultaneously. Tests are started as synchronously as possible using a semaphore, in order to test possible race conditions. If this isn't desired, the test parameter "sleep_us" can be set to the number of microseconds to wait between udev invocations.

More often than not, the created devnode is the basename of the sysfs entry. The "devnode" device may be used to override the auto-detected node name. Permissions and major/minor number are now verified on the devnode itself, not on symlinks. For those tests where exp_name is set to the computed devnode name, the explicit "exp_name" can be removed. "exp_name" is only required for symlinks. This allows separate testing for devnodes and symlinks an a follow-up patch.

Test if symlinks are created correctly by comparing the symlink targets to the devnode path. This implies (for the symlink) that major/minor numbers and permissions are correct, as we have tested that on the devnode already.

Instead of testing the existence or non-exisitence of just a single symlink, allow testing of several links per device. Change the test definitions accordingly.

udev hasn't supported renaming device nodes for some time.

the "last_rule" option hasn't been supported for some time. Therefore this test fails if a "not_exp_links" attribute is added, as it should be. Mark it appropriately.

Add some rules that make it a bit harder to pass, mainly the non-existence checks.

These rules have survived from an ancient version of the code and save no purpose any more.

As we can check multiple links in a single test now, these 3 tests can be merged into one.

As we can test multiple devices and multiple links per device in one test now, these two tests can be merged into one.

This is helpful to catch possible regressions in the test. Also, don't count wait() errors, they are likely not udev errors.

Add 4 new tests using multiple devices. Number 2-4 use many devices claiming the same symlink, where only one device has a higher priority thatn the others. They fail sporadically with the current code, if a race condition causes the symlink to point to the wrong device. Test 4 is like test 2 with sleeps in between, it's much less likely to fail.

for easier reproduction of sporadic test failures.

Manually listing all devices in the test definition becomes cumbersome with lots of devices. Add a function that scans on all block devices in the test sysfs and generates a list of devices to test.

Script for generating LOTS of SCSI disk and partition devices in the fake sysfs we use for udev testing. This script is supposed to be run after sys-script.py. It uses code from sys-script.py as template to generate additional SCSI disk data structures. Together with the "generator" code in udev-test.pl added in the previous patch, it allows to run udev tests with almost arbitrarily many devices, and thus to do performance scaling tests.

umount emits an error message "no mount point specified" if the tmpfs isn't mounted yet, which is the normal case. Suppress that by redirecting stderr.

Since 'test/udev-test.pl: count "good" results', we know how many checks succeeded. Add an "expected good" count to make that number more meaningful.

mwilck · 2020-11-02T15:22:35Z

Hi Michael,

I finally came down to testing your code. First I made sure the issue reproduced with current master (it does, easily). Then I tested your tree with 50 devices (python sd-test.py . 50), and it worked well 👍 . With 200 devices, it also succeeds, but we got scaling issues again.

perf top shows output like this:

+   70,32%     0,00%  test-udev                 [.] link_update                                                                       
-   69,95%     0,11%  test-udev                 [.] link_find_prioritized                                                             
+   47,92%     1,35%  [kernel]                  [k] entry_SYSCALL_64_after_hwframe                                                    
+   47,91%     0,07%  libsystemd-shared-246.so  [.] sd_device_new_from_device_id                                                      
+   47,07%     0,13%  libsystemd-shared-246.so  [.] sd_device_new_from_devnum                                                         
+   46,89%     0,04%  libsystemd-shared-246.so  [.] sd_device_new_from_syspath                                                        
+   46,67%     1,20%  [kernel]                  [k] do_syscall_64                                                                     
+   44,15%     0,12%  libsystemd-shared-246.so  [.] device_set_syspath

IIRC this is not worse than what I saw previously with my code - hard to compare anyway because the hardware has changed in the meantime.

Anyway, great job!

mwilck · 2020-11-02T15:58:33Z

Some numbers from my test system (24 x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 64 GB RAM), from time udev-test.pl 161, all times in seconds:

number of devices	real	user	sys
1	1.3	5.1	4.5
10	3.0	24	27
20	6.5	57	64
50	24	220	250
100	71	690	760
200	300	3000	3300

It scales roughly quadratically with the number of devices. Maybe the change in the directory structure I made in 6b2db24 would improve matters somewhat. Anyway, in practice there are rarely more than ~10 contenders for a symlink.

poettering · 2020-11-04T16:51:45Z

i didn't review the perl stuff, I am not a perlguru enough for this

msekletar · 2020-11-05T17:13:48Z

@poettering @keszybz thanks for the reviews so far! PTAL at the newest version.

msekletar · 2020-11-05T17:14:57Z

@keszybz I have no idea why hwdb-test fails. AFAICT, hwdb doesn't use any code that I've touched in this PR.

src/udev/udev-node.c

msekletar · 2020-11-05T20:12:27Z

@keszybz updated PTAL.

I removed initialisation of i,r and st1.

src/udev/udev-node.c

src/basic/stat-util.c

src/udev/udev-node.c

Note that st_mtime member of struct stat is defined as follows, #define st_mtime st_mtim.tv_sec Hence we omitted checking nanosecond part of the timestamp (struct timespec) and possibly would miss modifications that happened within the same second.

…ptible to race conditions Previously it was very likely, when multiple contenders for the symlink appear in parallel, that algorithm would select wrong symlink (i.e. one with lower-priority). Now the algorithm is much more defensive and when we detect change in set of contenders for the symlink we reevaluate the selection. Same happens when new symlink replaces already existing symlink that points to different device node.

msekletar · 2020-11-06T12:35:42Z

@poettering @keszybz updated.

Thank you for all the suggestions and help.

keszybz · 2020-11-10T08:42:32Z

@mwilck, @msekletar, good to see this done!

mwilck · 2021-02-01T21:20:41Z

Eventually, we had the customer test this code who originally reported this race and made me suggest #9551 in the first place. Results are mixed, unfortunately. First of all, we see

[   27.101200] dl360g91 systemd-udevd[685]: sdv: Failed to update device symlinks: Too many levels of symbolic links
[   27.132792] dl360g91 systemd-udevd[625]: sdo: Failed to update device symlinks: Too many levels of symbolic links

This message is printed if one path looses the race for the symlink. The message "Too many levels of symbolic links", created from the -ELOOP return value from link_update(), is a bit misleading. Normally it indicates some sort of "symbolic link loop", but here it has a totally different meaning (maximum number of retries exceeded). This shows also that it's possible and not even very unlikely to exceed the maximum number of retries.

Note that this message is printed in a situation where only SCSI devices are contending for the same symlink. They all have the same priority. The issue here is, AFAICS, that every contender puts itself first in the list and will only give up if some other device claims a higher priority. It might be better to just use e.g. alphabetical ordering to avoid ping-pong between contenders.

In PR systemd#17431 we have introduced retry loop in link_update() in order to maximize the chance that we end up with correct target when there are multiple contenders for given symlink. Number of interations in retry loop is either 1 or LINK_UPDATE_MAX_RETRIES, depending on the value of 'initialized' db flag. When device appears for the first time we need to set the flag before calling link_update() via update_devnode() for the second time to make sure we run the second invocation with higher retry loop counter.

In PR systemd#17431 we have introduced retry loop in link_update() in order to maximize the chance that we end up with correct target when there are multiple contenders for given symlink. Number of iterations in retry loop is either 1 or LINK_UPDATE_MAX_RETRIES, depending on the value of 'initialized' db flag. When device appears for the first time we need to set the flag before calling link_update() via update_devnode() for the second time to make sure we run the second invocation with higher retry loop counter.

In PR #17431 we have introduced retry loop in link_update() in order to maximize the chance that we end up with correct target when there are multiple contenders for given symlink. Number of iterations in retry loop is either 1 or LINK_UPDATE_MAX_RETRIES, depending on the value of 'initialized' db flag. When device appears for the first time we need to set the flag before calling link_update() via update_devnode() for the second time to make sure we run the second invocation with higher retry loop counter.

msekletar added the udev label Oct 23, 2020

poettering requested changes Oct 23, 2020

View reviewed changes

src/udev/udev-node.c Outdated Show resolved Hide resolved

src/udev/udev-event.c Outdated Show resolved Hide resolved

src/udev/udev-node.c Outdated Show resolved Hide resolved

src/udev/udev-node.c Outdated Show resolved Hide resolved

src/udev/udev-node.c Outdated Show resolved Hide resolved

poettering added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Oct 23, 2020

msekletar force-pushed the udev-link-update-race branch 3 times, most recently from c02230d to f9e765c Compare November 2, 2020 10:08

mwilck added 20 commits November 2, 2020 11:43

test/udev-test.pl: allow multiple devices per test

255c05b

Allow testing cases where multiple devices are added and removed. This implies a change of the data structure: every test allows for multiple devices to be added, and "exp_name" etc. are now properties of the device, not of the test.

test/udev-test.pl: create rules only once

af7ee3e

It's not necessary to write the rules for every udev run, as we now may have many (rather than just 2) per test.

test/udev-test.pl: test correctness of symlink targets

997683c

Test if symlinks are created correctly by comparing the symlink targets to the devnode path. This implies (for the symlink) that major/minor numbers and permissions are correct, as we have tested that on the devnode already.

test/udev-test.pl: allow checking multiple symlinks

e62acc3

Instead of testing the existence or non-exisitence of just a single symlink, allow testing of several links per device. Change the test definitions accordingly.

test/udev-test.pl: fix wrong test descriptions

46bc71b

udev hasn't supported renaming device nodes for some time.

test/udev-test.pl: last_rule is unsupported

17cce03

the "last_rule" option hasn't been supported for some time. Therefore this test fails if a "not_exp_links" attribute is added, as it should be. Mark it appropriately.

test/udev-test.pl: Make some tests a little harder

06d4d4e

Add some rules that make it a bit harder to pass, mainly the non-existence checks.

test/udev-test.pl: remove bogus rules from magic subsys test

86634df

These rules have survived from an ancient version of the code and save no purpose any more.

test/udev-test.pl: merge "space and var with space" tests

2084fe0

As we can check multiple links in a single test now, these 3 tests can be merged into one.

test/udev-test.pl: merge import parent tests into one

a96cd21

As we can test multiple devices and multiple links per device in one test now, these two tests can be merged into one.

test/udev-test.pl: count "good" results

b95c439

This is helpful to catch possible regressions in the test. Also, don't count wait() errors, they are likely not udev errors.

test/udev-test.pl: add repeat count

2ab0a8d

for easier reproduction of sporadic test failures.

test/udev-test.pl: generator for large list of block devices

eb44d71

Manually listing all devices in the test definition becomes cumbersome with lots of devices. Add a function that scans on all block devices in the test sysfs and generates a list of devices to test.

test/udev-test.pl: suppress umount error message at startup

f1cb086

umount emits an error message "no mount point specified" if the tmpfs isn't mounted yet, which is the normal case. Suppress that by redirecting stderr.

test/udev_test.pl: add "expected good" count

cbeb23d

Since 'test/udev-test.pl: count "good" results', we know how many checks succeeded. Add an "expected good" count to make that number more meaningful.

test/sys-script.py: add missing DEVNAME entries to uevents

dec8e6c

poettering added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Nov 4, 2020

msekletar force-pushed the udev-link-update-race branch from f9e765c to 27cbccb Compare November 5, 2020 17:13

msekletar removed the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Nov 5, 2020

keszybz requested changes Nov 5, 2020

View reviewed changes

src/udev/udev-node.c Outdated Show resolved Hide resolved

src/udev/udev-node.c Show resolved Hide resolved

keszybz added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Nov 5, 2020

msekletar force-pushed the udev-link-update-race branch 2 times, most recently from 01a7d8a to aaf0ef7 Compare November 5, 2020 20:08

keszybz reviewed Nov 6, 2020

View reviewed changes

src/udev/udev-node.c Outdated Show resolved Hide resolved

poettering requested changes Nov 6, 2020

View reviewed changes

src/basic/stat-util.c Outdated Show resolved Hide resolved

src/udev/udev-node.c Outdated Show resolved Hide resolved

src/udev/udev-node.c Show resolved Hide resolved

src/udev/udev-node.c Show resolved Hide resolved

msekletar force-pushed the udev-link-update-race branch 2 times, most recently from 084f427 to 52edc45 Compare November 6, 2020 12:27

msekletar force-pushed the udev-link-update-race branch from 52edc45 to 30f6dce Compare November 6, 2020 12:35

msekletar removed the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Nov 6, 2020

keszybz mentioned this pull request Nov 10, 2020

core: add missing oom check #17554

Merged

keszybz merged commit b8aac50 into systemd:master Nov 10, 2020

poettering mentioned this pull request Dec 3, 2020

udev: fix race condition in symlink generation code #9551

Closed

msekletar mentioned this pull request Mar 4, 2021

udev: run link_update() with increased retry count in second invocation #18881

Merged

mwilck mentioned this pull request Apr 29, 2022

pid1: several fixlets for device handling #23218

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

udev: make algorithm that selects highest priority devlink less susceptible to race conditions #17431

udev: make algorithm that selects highest priority devlink less susceptible to race conditions #17431

msekletar commented Oct 23, 2020

msekletar commented Oct 23, 2020 •

edited

Loading

poettering left a comment

msekletar commented Nov 2, 2020

mwilck commented Nov 2, 2020 •

edited

Loading

mwilck commented Nov 2, 2020 •

edited

Loading

poettering commented Nov 4, 2020

msekletar commented Nov 5, 2020

msekletar commented Nov 5, 2020

msekletar commented Nov 5, 2020

msekletar commented Nov 6, 2020

keszybz commented Nov 10, 2020

mwilck commented Feb 1, 2021 •

edited

Loading

udev: make algorithm that selects highest priority devlink less susceptible to race conditions #17431

udev: make algorithm that selects highest priority devlink less susceptible to race conditions #17431

Conversation

msekletar commented Oct 23, 2020

msekletar commented Oct 23, 2020 • edited Loading

poettering left a comment

Choose a reason for hiding this comment

msekletar commented Nov 2, 2020

mwilck commented Nov 2, 2020 • edited Loading

mwilck commented Nov 2, 2020 • edited Loading

poettering commented Nov 4, 2020

msekletar commented Nov 5, 2020

msekletar commented Nov 5, 2020

msekletar commented Nov 5, 2020

msekletar commented Nov 6, 2020

keszybz commented Nov 10, 2020

mwilck commented Feb 1, 2021 • edited Loading

msekletar commented Oct 23, 2020 •

edited

Loading

mwilck commented Nov 2, 2020 •

edited

Loading

mwilck commented Nov 2, 2020 •

edited

Loading

mwilck commented Feb 1, 2021 •

edited

Loading