Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Rework verity_partition() #30075

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

fbuihuu
Copy link
Contributor

@fbuihuu fbuihuu commented Nov 17, 2023

Let's see what CI says...

@github-actions github-actions bot added udev util-lib please-review PR is ready for (re-)review by a maintainer labels Nov 17, 2023
Copy link

An -rc1 tag has been created and a release is being prepared, so please note that PRs introducing new features and APIs will be held back until the new version has been released.

@bluca
Copy link
Member

bluca commented Nov 17, 2023

As already said, it is not acceptable to make independent units activation processes block on each other. This is completely unnecessary, it serves no purpose other than implementing a micro-optimization to avoid creating a single symlink.

@bluca bluca added needs-discussion 🤔 and removed please-review PR is ready for (re-)review by a maintainer labels Nov 17, 2023
@fbuihuu
Copy link
Contributor Author

fbuihuu commented Nov 21, 2023

Please stop abusing this "micro-perf" argument. I know that it was mentioned in pr#29691 but I can't believe at this point that you didn't get the main point Martin and I were trying to make after so many comments. So just in case and once again: any code relying on the order of the creation of the symlinks is broken.

Looking at the implementation of verity_partition(), it's very likely that this code is (still) subject to races. And in order to make progress we were asked to propose an alternative for the current implementation, so here is one and it's pretty simple: it simply uses a lock per image before calling libcyptsetup, which apparently doesn't support concurrent accesses to the same image.

Apparently your concern with this implementation is about performance. I'm not sure to see how a busy loop sleeping during an arbitrary amount of time when several services are disputing the same image at the same time can be faster honestly, pretty the opposite if you ask me. Actually this unfortunately reminds me a similar story we had in the past with udev's symlink handling (yes again these symlinks) that was eventually fixed by using proper locking.

Can you please give some details about the test cases you're worry about ? Is itthe one where multiple services activate the same verity device at boot or is it something more involved ?

Did you already realize some benchmark that you could share ?

@github-actions github-actions bot added the please-review PR is ready for (re-)review by a maintainer label Nov 21, 2023
@fbuihuu
Copy link
Contributor Author

fbuihuu commented Nov 21, 2023

@mrc0mmand could you please run your reproducer once again to check whether this PR fixes #28141 ?

It wouldn't make much sense to continue the discussion if the problem still exists although I'm pretty sure that verity_partition() should be rewritten.

I tried several time with the old and this code but unfortunately failed so far.

@YHNdnzj
Copy link
Member

YHNdnzj commented Nov 21, 2023

I agree with @fbuihuu. Taking a lock on the image is a more reasonable thing to do here. Any objection should be based on real benchmark data at least.

Besides, we have arguably introduced some "performance regression" in systemd-executor, but for a good reason. The same applies here, as we can step away from relying on undefined behavior of udev.

@mrc0mmand
Copy link
Member

@mrc0mmand could you please run your reproducer once again to check whether this PR fixes #28141 ?

It wouldn't make much sense to continue the discussion if the problem still exists although I'm pretty sure that verity_partition() should be rewritten.

I tried several time with the old and this code but unfortunately failed so far.

So far it looks like it fails even without the reproducer, see the two CentOS CI jobs:

[   25.269726] (cat)[896]: DEBUG: activating verity device 5d08f8b9a9913947c7c632b53cd303116a18c469c92c87623e6b3c0a11fa8b27-verity (free)
[   25.271772] (cat)[896]: device-mapper: reload ioctl on 5d08f8b9a9913947c7c632b53cd303116a18c469c92c87623e6b3c0a11fa8b27-verity (254:0) failed: Required key not available
[   25.445730] kernel: loop1: detected capacity change from 0 to 4992
[   25.351439] (cat)[896]: Trying to reuse existing verity device 5d08f8b9a9913947c7c632b53cd303116a18c469c92c87623e6b3c0a11fa8b27-verity
[   25.353190] (cat)[896]: Failed to check if existing verity device 5d08f8b9a9913947c7c632b53cd303116a18c469c92c87623e6b3c0a11fa8b27-verity can be reused: Invalid argument
[   25.457022] (cat)[896]: run-u17.service: Failed to set up mount namespacing: /run/img2: Invalid argument
[   25.458692] systemd[1]: run-u17.service: Main process exited, code=exited, status=226/NAMESPACE
[   25.458908] systemd[1]: run-u17.service: Failed with result 'exit-code'.

@bluca
Copy link
Member

bluca commented Nov 21, 2023

I agree with @fbuihuu. Taking a lock on the image is a more reasonable thing to do here. Any objection should be based on real benchmark data at least.

Besides, we have arguably introduced some "performance regression" in systemd-executor, but for a good reason. The same applies here, as we can step away from relying on undefined behavior of udev.

No, we are not adding any locking between units, that is absolutely a non-starter, especially for some nonsensical performance micro optimization. It's just a single symlink ffs, even with some of those absurd use cases that are driving all these performance optimizations with a thousand disks it can't be that bad. This nonsense broke my production use case completely twice already, there's not going to be a third one. The CI is all red already. Things are back to working just fine right now, and it wasn't cheap, we are not going to re-add all those regressions because some people don't like creating symlinks.

@bluca bluca removed the please-review PR is ready for (re-)review by a maintainer label Nov 21, 2023
@mrc0mmand
Copy link
Member

I don't really grok all the stuff involved, but just from the "CI guy" perspective - before 7ec5ce5 and 4ef83d9 were introduced, TEST-50-DISSECT would regularly fail several times a day, for a long time. But after the two commits were merged it hasn't failed once (at least not without a reason). So, again - from the CI perspective, even though there is apparently still some race, the current status quo seems pretty stable to me.

@yuwata
Copy link
Member

yuwata commented Nov 21, 2023

Changing or improving logic of dissect is OK if it works, and if it does not provide any significant slow down.

But, at least, the current form of the change should not work, as the issue occurs on activating a verity volume while another volume with the same backing image file is deactivating. The issue is NOT in activating multiple verity volumes at the same time.

@yuwata
Copy link
Member

yuwata commented Nov 21, 2023

@YHNdnzj

Any objection should be based on real benchmark data at least.

Yes. If it works. But the benchmark data should be provided by the change proposer.

Besides, we have arguably introduced some "performance regression" in systemd-executor, but for a good reason. The same applies here, as we can step away from relying on undefined behavior of udev.

systemd-executor is NOT introduced for solving a regression. I do not think these two situations are comparable.

we can step away from relying on undefined behavior of udev.

Unfortunately, no. Reverting my fix does not make the owner of symlink well defined, as uded processes multiple uevents at the same time. Still there is no guarantee which device own a symlink when multiple devices request the same symlink.

I cannot understand why @fbuihuu and @mwilck want to change the logic even though we cannot provide any guarantee about owner device.

@mwilck
Copy link
Contributor

mwilck commented Nov 21, 2023

@yuwata,

Still there is no guarantee which device own a symlink when multiple devices request the same symlink.

There will never be such a guarantee. Your modification only fixes one specific use case.

In #29691, we have provided several good reaons why we think that in the general case, the current owner of any given symlink should not be swapped against a another device, unless a) that other device has a higher link_priority or b) the current owner is in the process of being removed. You actually agreed to that, IIRC.

@bluca
Copy link
Member

bluca commented Nov 21, 2023

Still there is no guarantee which device own a symlink when multiple devices request the same symlink.

There will never be such a guarantee. Your modification only fixes one specific use case.

And yet everything worked fine for the longest of time before your performance micro optimization broke it, and everything worked fine again after that breakage was reverted. Meanwhile, this PR breaks it again. But I guess there are "several good reasons" written in some comments somewhere, so that justifies breaking stuff for other people, right?

@yuwata
Copy link
Member

yuwata commented Nov 21, 2023

@mwilck Sorry, I have not followed recent discussion in #29691. I will read and comment later.

Use proper locking instead of trying to handle all possibles races that can
happen in the middle of the process. Furthermore a busy loop which sleep during
an arbitrary amount of time can't be faster in the case where multiple services
want to activate the same verity image at the same time.

While the lock is taken we can assume that only one process can activate or
reuse an image at a given time. Hence there's no need to retry an arbitrarty
number of times when something fishy happened. This also makes the code
independent of the way libcryptsetup/device-mapper handle concurrent access to
dm devices.

Deferred remove is now exclusively canceled and restored in verity_partition()
so no other service can restore deferred remove while another service is
processing the same device. IOW this patch also removes the possible source of
races that could happen with automatic deferred removes done by the kernel.

Last but not least, we don't access the verity devices via udev's symlink
anymore but the device nodes "/dev/dm-X" are used. Hence there's no more need
to wait for udev to create the symlink when the device is activated.
When a signature is provided but the root hash of the existing device hasn't
been marked as validated then we don't really know if the existing device was
validated by userspace or if signing was disabled.

But it actually doesn't matter, reusing the verity device should be fine if we
validate the root hash this time and it's faster than requesting another dm
device that would be the same, only the name would be different.
@github-actions github-actions bot added the please-review PR is ready for (re-)review by a maintainer label Nov 22, 2023
@fbuihuu
Copy link
Contributor Author

fbuihuu commented Nov 22, 2023

So, again - from the CI perspective, even though there is apparently still some race, the current status quo seems pretty stable to me.

That's not how reliable projects are supposed to work IMHO. In this case we believe that a race is the source of the regression spotted in TEST-50-DISSECT and the answer to address this race is a hack. The source of the race has not been understood actually. And most honest people looking at the related code in systemd (verity_partition() to name it) will tell you that this code is subject to races, it's obvious and it's obviously fragile.

I can understand your frustration to change this part of the code again but please understand that the fact that TEST-50-DISSECT has been unstable until now just confirms that this function is fragile and must be reworked. I'm pretty sure that if you dig into the history of the related code, it's been fixed multiple times.

Back to this PR, I normally fixed the problem.

@yuwata
Copy link
Member

yuwata commented Nov 22, 2023

@bluca Please calm down...

@fbuihuu
Cleaning up dissecting logic is OK if it works, and if it gives better stability and/or performance.
But, as I said multiple times, the undocumented behavior is mostly API and may be used by outside of systemd. So we cannot revert the fix for udevd, even if dissecting logic will not rely on the behavior.

@fbuihuu
Copy link
Contributor Author

fbuihuu commented Nov 22, 2023

@fbuihuu Cleaning up dissecting logic is OK if it works, and if it gives better stability and/or performance.

The first step would be to verified that this PR is fixing the regression seen in TEST-50-DISSECT.

@mrc0mmand just in case you fled the place due to the nice atmosphere, it would still be nice if you can run your reproducer.

But, as I said multiple times, the undocumented behavior is mostly API and may be used by outside of systemd. So we cannot revert the fix for udevd, even if dissecting logic will not rely on the behavior.

The order of the symlinks is now irrelevant at this point since the logic to set verity devices up is likely the problem. It's your choice to keep the workaround in udev. As you know already, I disagree but it doesn't hurt.

For the moment I'm keeping it reverted to verify that the regression is no more reproducible with this PR.

@fbuihuu fbuihuu reopened this Nov 22, 2023
@github-actions github-actions bot added the please-review PR is ready for (re-)review by a maintainer label Nov 22, 2023
@fbuihuu fbuihuu changed the title [WIP] Rework verity partition [RFC] Rework verity partition Nov 22, 2023
@fbuihuu fbuihuu changed the title [RFC] Rework verity partition [RFC] Rework verity_partition() Nov 22, 2023
@mrc0mmand
Copy link
Member

@fbuihuu Cleaning up dissecting logic is OK if it works, and if it gives better stability and/or performance.

The first step would be to verified that this PR is fixing the regression seen in TEST-50-DISSECT.

@mrc0mmand just in case you fled the place due to the nice atmosphere, it would still be nice if you can run your reproducer.

Ah, sorry, just forgot to report back - I had it running for 4 hours in a loop and it didn't crash once, so it looks promising. I'll run it once more, just to be sure.

@YHNdnzj
Copy link
Member

YHNdnzj commented Nov 22, 2023

@yuwata I just realized that my previous comments might have caused some misunderstanding.

Like you said, we shouldn't merge the PR as-is - the two reverts of your PR should be dropped if we ought to merge this. Assume that we really want to change the behavior of udev, at least we need something in NEWS stating that as a future incompatible change, and change it only after several releases - as we always do. The reverts during the development of this PR, however, is needed to properly test the reliability of it through CI.

So I'm not supporting reverting your fixes, at least not in the current form. Sorry for the misunderstanding.

@YHNdnzj
Copy link
Member

YHNdnzj commented Nov 22, 2023

@bluca Please try to calm down. We're not trying to break anything here. And we shall not merge it until the CI passes consistently and the performance measurements are taken into account.

@yuwata
Copy link
Member

yuwata commented Nov 22, 2023

@fbuihuu

For the moment I'm keeping it reverted to verify that the regression is no more reproducible with this PR.

@YHNdnzj

The reverts during the development of this PR, however, is needed to properly test the reliability of it through CI.

OK. Then, let's make this draft, and set dont-merge tag, to make this not merged accidentally.

@yuwata yuwata marked this pull request as draft November 22, 2023 16:56
@yuwata
Copy link
Member

yuwata commented Nov 22, 2023

So, I have not read the detail of the change yet, but if it really works, please provide benchmark data.

@fbuihuu
Copy link
Contributor Author

fbuihuu commented Nov 22, 2023

I finally managed to run some benchmarks.

They are pretty basic and consist in measuring the execution time of transient services running in parallel and started at the same time. Like TEST-50-DISSEC does, each transient service mounts the same verity image twice so the code used to reuse the same image should be very relevant. The signature of the root hash is also validated.

The results are pretty unexpected to me. I could have expected some differences but not with such orders. So it could be nice if someone could look at the script and run it on its own setup.

The performance measurements have been done on a VM running Tumbleweed with 4 vcpus and 8Gi RAM.

Each column shows the result for a given number of services running in parallel. For a given number of services the test has been run ten times. Timings are expressed in ms.

Results for the old code

Loop 2 services 5 services 10 services 15 services
1 267 619 794 1123
267 443 1142 1131
254 625 1097 1262
251 631 1001 1076
5 241 429 845 1173
239 573 811 1102
377 626 1207 1073
230 718 747
285 675 929
10 299 420 1136
------ --------- --------- --------- ----------
mean 271.000 575.900 970.900 1134.286
sdev 42.872 107.177 167.890 66.009

Results for this PR

2 5 10 15 20 25 30 35 40
162 240 331 454 519 685 813 905 1055
107 256 383 542 579 614 753 906 923
149 244 333 414 559 672 774 922 986
156 241 326 444 611 697 797 909 988
138 206 334 408 579 670 790 923 876
133 211 384 468 597 710 858 844 1006
177 209 320 465 599 590 773 892 966
134 222 342 459 566 626 770 953 1072
139 199 368 425 517 644 821 836 1076
178 227 365 544 559 652 728 912 971
------ --------- --------- --------- --------- --------- --------- --------- --------- ---------
mean 147.300 225.500 348.600 462.300 568.500 656.000 787.700 900.200 991.900
sdev 21.807 19.156 24.084 47.322 31.725 38.108 36.920 35.596 64.124

You might wonder why the old code was tested only with a maximum of 15 services. Well because it broke...

I suspect that the last patch helps a lot but it needs to be reviewed as I'm not sure about it.

So I suppose that the next step is to run again the performance measurements without the validation of the root hash but that will be for tomorrow.

@bluca
Copy link
Member

bluca commented Nov 22, 2023

You really need to learn to take "no" for an answer. I couldn't possibly be any clearer: this is not going to be merged, in this form or in any other, regardless of whatever you come up with, as your performance micro-optimizations have done enough damage already

@bluca bluca removed the please-review PR is ready for (re-)review by a maintainer label Nov 22, 2023
@evverx
Copy link
Member

evverx commented Nov 23, 2023

This nonsense broke my production use case completely twice already, there's not going to be a third one.

How did it even reach the production use case twice? It looks almost like all the changes are blindly taken from GitHub with no testing whatsoever. Maybe that sort of sloppy craftsmanship should be addressed to prevent anything like that from happening again.

I couldn't possibly be any clearer

Why should anyone care?

@mrc0mmand
Copy link
Member

@fbuihuu Cleaning up dissecting logic is OK if it works, and if it gives better stability and/or performance.

The first step would be to verified that this PR is fixing the regression seen in TEST-50-DISSECT.
@mrc0mmand just in case you fled the place due to the nice atmosphere, it would still be nice if you can run your reproducer.

Ah, sorry, just forgot to report back - I had it running for 4 hours in a loop and it didn't crash once, so it looks promising. I'll run it once more, just to be sure.

The second round finished with the same result, still no fail after another 4 hours.

@fbuihuu
Copy link
Contributor Author

fbuihuu commented Nov 23, 2023

So I suppose that the next step is to run again the performance measurements without the validation of the root hash but that will be for tomorrow.

Here are the performance measurements without no root hash validation on the same VM.

The old code broke again when the number of services run in // is greater than 20 but at least I got some figures to compare with this PR.

Results for old code (commit 48bfc67)

2 5 10 15 20 25 30 35 40
150 388 524 808 970 754 1283 1119 2124
145 254 407 591 670 1350 1143 1457 1087
169 192 361 507 551 754 1441 1682 1184
209 243 345 441 716 1247 1372 1177 1156
169 215 487 443 1125 660 908 1439 1556
161 217 340 451 600 1068 1216 1284 1720
135 265 404 775 613 1136 1181 1679 2094
150 249 277 639 662 1205 1133 1543
132 246 516 698 585 failed failed failed failed
124 260 364 705 975 failed failed failed failed
------- ------- ------ ------- ------ ------- ------ ------- ------ ------
mean 154 253 403 606 747 1022 1210 1423 1560

Results for this PR

2 5 10 15 20 25 30 35 40
150 219 402 496 637 686 681 991 1091
180 226 383 447 599 771 900 979 1088
145 196 396 452 586 691 786 915 1159
127 210 337 511 622 756 713 888 1079
154 217 312 471 642 717 838 957 1104
152 232 391 569 582 735 835 929 1073
185 213 385 461 578 680 819 934 1098
163 215 411 467 492 711 908 956 1025
156 170 396 487 550 726 897 994 1052
135 248 407 507 526 703 880 947 1095
------- ------- ------ ------- ------ ------- ------ ------- ------ ------
mean 155 215 382 487 581 718 826 949 1086

Result comparison

2 5 10 15 20 25 30 35 40
48bfc67 154 253 403 606 747 1022 1210 1423 1560
this pr 155 215 382 487 581 718 826 949 1086
----- ----- --------- --------- --------- --------- --------- --------- --------- ---------
rel diff 0% -17% -5% -24% -28% -42% -46% -49% -43%

Again the results are self speaking. This PR is at least 40% faster when the number of services run in // is >= 25 on this system.

Let me know if you need other benchmark results.

@yuwata
Copy link
Member

yuwata commented Nov 23, 2023

Does this change for dissect.c work even without the revert for the udev? Please confirm that.

P.S.
I mean, please confirm that the new logic does not rely on the modified undocumented udevd behavior.

@fbuihuu
Copy link
Contributor Author

fbuihuu commented Nov 27, 2023

Does this change for dissect.c work even without the revert for the udev? Please confirm that.

P.S. I mean, please confirm that the new logic does not rely on the modified undocumented udevd behavior.

Yes the new logic doesn't depend on the behavior of udevd (whatever it is) since it actually fixes all races that can happen in the old code and thus is the genuine fix for #28141.

And yes I tested it.

@bluca
Copy link
Member

bluca commented Nov 27, 2023

I am pretty sure you haven't actually tested anything, not for real anyway, as if you had done any real-world usage of dm-verity you would know that it is not the private property of systemd, but it is abundantly used also via scripts, veritysetup, libcryptsetup, etc - concurrently, obviously. So a private systemd-specific lock, other than being a truly bad idea, cannot possibly work, and re-introduces all the problems (the real ones) that were slowly and painstakingly fixed, all over the place. As anybody who actually uses dm-verity knows, the right place to solve concurrency issues is in the kernel.

This is a bad idea, with a broken implementation, that breaks a bunch of real-world use cases, all in the attempt to fix a non-existing problem.

@fbuihuu
Copy link
Contributor Author

fbuihuu commented Dec 11, 2023

@poettering, I think it's time to step in.

Perhaps you'll be able to explain the presence of this broken (and slow) code in systemd. If that's the case and you're aware of any real use case that could justify its presence please enlight us.

Discussions in issue #28141 are also pretty insane in several aspects but more importantly closing the issue by arguing that systemd could rely on a behavior that is clearly documented as undefined (most likely since the beginning) to work around the racy code looks weird, at least.

Of course the races are still present and the test case provided in this PR showed that the current code is still failing when 15 services are activating the same verity image in // even if the old "undefined" behavior has been restored.

It seems that the whole mess can be simply fixed by a locking approach: it passes all (known) tests, it is faster and it is much simpler.

PTAL.

@fbuihuu
Copy link
Contributor Author

fbuihuu commented Jan 12, 2024

@poettering @yuwata ping...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

7 participants