New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Rework verity_partition() #30075
base: main
Are you sure you want to change the base?
Conversation
An -rc1 tag has been created and a release is being prepared, so please note that PRs introducing new features and APIs will be held back until the new version has been released. |
As already said, it is not acceptable to make independent units activation processes block on each other. This is completely unnecessary, it serves no purpose other than implementing a micro-optimization to avoid creating a single symlink. |
Please stop abusing this "micro-perf" argument. I know that it was mentioned in pr#29691 but I can't believe at this point that you didn't get the main point Martin and I were trying to make after so many comments. So just in case and once again: any code relying on the order of the creation of the symlinks is broken. Looking at the implementation of verity_partition(), it's very likely that this code is (still) subject to races. And in order to make progress we were asked to propose an alternative for the current implementation, so here is one and it's pretty simple: it simply uses a lock per image before calling libcyptsetup, which apparently doesn't support concurrent accesses to the same image. Apparently your concern with this implementation is about performance. I'm not sure to see how a busy loop sleeping during an arbitrary amount of time when several services are disputing the same image at the same time can be faster honestly, pretty the opposite if you ask me. Actually this unfortunately reminds me a similar story we had in the past with udev's symlink handling (yes again these symlinks) that was eventually fixed by using proper locking. Can you please give some details about the test cases you're worry about ? Is itthe one where multiple services activate the same verity device at boot or is it something more involved ? Did you already realize some benchmark that you could share ? |
badddb4
to
e8f7037
Compare
@mrc0mmand could you please run your reproducer once again to check whether this PR fixes #28141 ? It wouldn't make much sense to continue the discussion if the problem still exists although I'm pretty sure that I tried several time with the old and this code but unfortunately failed so far. |
I agree with @fbuihuu. Taking a lock on the image is a more reasonable thing to do here. Any objection should be based on real benchmark data at least. Besides, we have arguably introduced some "performance regression" in systemd-executor, but for a good reason. The same applies here, as we can step away from relying on undefined behavior of udev. |
So far it looks like it fails even without the reproducer, see the two CentOS CI jobs:
|
No, we are not adding any locking between units, that is absolutely a non-starter, especially for some nonsensical performance micro optimization. It's just a single symlink ffs, even with some of those absurd use cases that are driving all these performance optimizations with a thousand disks it can't be that bad. This nonsense broke my production use case completely twice already, there's not going to be a third one. The CI is all red already. Things are back to working just fine right now, and it wasn't cheap, we are not going to re-add all those regressions because some people don't like creating symlinks. |
I don't really grok all the stuff involved, but just from the "CI guy" perspective - before 7ec5ce5 and 4ef83d9 were introduced, TEST-50-DISSECT would regularly fail several times a day, for a long time. But after the two commits were merged it hasn't failed once (at least not without a reason). So, again - from the CI perspective, even though there is apparently still some race, the current status quo seems pretty stable to me. |
Changing or improving logic of dissect is OK if it works, and if it does not provide any significant slow down. But, at least, the current form of the change should not work, as the issue occurs on activating a verity volume while another volume with the same backing image file is deactivating. The issue is NOT in activating multiple verity volumes at the same time. |
Yes. If it works. But the benchmark data should be provided by the change proposer.
systemd-executor is NOT introduced for solving a regression. I do not think these two situations are comparable.
Unfortunately, no. Reverting my fix does not make the owner of symlink well defined, as uded processes multiple uevents at the same time. Still there is no guarantee which device own a symlink when multiple devices request the same symlink. I cannot understand why @fbuihuu and @mwilck want to change the logic even though we cannot provide any guarantee about owner device. |
There will never be such a guarantee. Your modification only fixes one specific use case. In #29691, we have provided several good reaons why we think that in the general case, the current owner of any given symlink should not be swapped against a another device, unless a) that other device has a higher |
And yet everything worked fine for the longest of time before your performance micro optimization broke it, and everything worked fine again after that breakage was reverted. Meanwhile, this PR breaks it again. But I guess there are "several good reasons" written in some comments somewhere, so that justifies breaking stuff for other people, right? |
Use proper locking instead of trying to handle all possibles races that can happen in the middle of the process. Furthermore a busy loop which sleep during an arbitrary amount of time can't be faster in the case where multiple services want to activate the same verity image at the same time. While the lock is taken we can assume that only one process can activate or reuse an image at a given time. Hence there's no need to retry an arbitrarty number of times when something fishy happened. This also makes the code independent of the way libcryptsetup/device-mapper handle concurrent access to dm devices. Deferred remove is now exclusively canceled and restored in verity_partition() so no other service can restore deferred remove while another service is processing the same device. IOW this patch also removes the possible source of races that could happen with automatic deferred removes done by the kernel. Last but not least, we don't access the verity devices via udev's symlink anymore but the device nodes "/dev/dm-X" are used. Hence there's no more need to wait for udev to create the symlink when the device is activated.
When a signature is provided but the root hash of the existing device hasn't been marked as validated then we don't really know if the existing device was validated by userspace or if signing was disabled. But it actually doesn't matter, reusing the verity device should be fine if we validate the root hash this time and it's faster than requesting another dm device that would be the same, only the name would be different.
e8f7037
to
64b64ca
Compare
That's not how reliable projects are supposed to work IMHO. In this case we believe that a race is the source of the regression spotted in TEST-50-DISSECT and the answer to address this race is a hack. The source of the race has not been understood actually. And most honest people looking at the related code in systemd ( I can understand your frustration to change this part of the code again but please understand that the fact that TEST-50-DISSECT has been unstable until now just confirms that this function is fragile and must be reworked. I'm pretty sure that if you dig into the history of the related code, it's been fixed multiple times. Back to this PR, I normally fixed the problem. |
@bluca Please calm down... @fbuihuu |
The first step would be to verified that this PR is fixing the regression seen in TEST-50-DISSECT. @mrc0mmand just in case you fled the place due to the nice atmosphere, it would still be nice if you can run your reproducer.
The order of the symlinks is now irrelevant at this point since the logic to set verity devices up is likely the problem. It's your choice to keep the workaround in udev. As you know already, I disagree but it doesn't hurt. For the moment I'm keeping it reverted to verify that the regression is no more reproducible with this PR. |
Ah, sorry, just forgot to report back - I had it running for 4 hours in a loop and it didn't crash once, so it looks promising. I'll run it once more, just to be sure. |
@yuwata I just realized that my previous comments might have caused some misunderstanding. Like you said, we shouldn't merge the PR as-is - the two reverts of your PR should be dropped if we ought to merge this. Assume that we really want to change the behavior of udev, at least we need something in NEWS stating that as a future incompatible change, and change it only after several releases - as we always do. The reverts during the development of this PR, however, is needed to properly test the reliability of it through CI. So I'm not supporting reverting your fixes, at least not in the current form. Sorry for the misunderstanding. |
@bluca Please try to calm down. We're not trying to break anything here. And we shall not merge it until the CI passes consistently and the performance measurements are taken into account. |
OK. Then, let's make this draft, and set dont-merge tag, to make this not merged accidentally. |
So, I have not read the detail of the change yet, but if it really works, please provide benchmark data. |
I finally managed to run some benchmarks. They are pretty basic and consist in measuring the execution time of transient services running in parallel and started at the same time. Like The results are pretty unexpected to me. I could have expected some differences but not with such orders. So it could be nice if someone could look at the script and run it on its own setup. The performance measurements have been done on a VM running Tumbleweed with 4 vcpus and 8Gi RAM. Each column shows the result for a given number of services running in parallel. For a given number of services the test has been run ten times. Timings are expressed in ms. Results for the old code
Results for this PR
You might wonder why the old code was tested only with a maximum of 15 services. Well because it broke... I suspect that the last patch helps a lot but it needs to be reviewed as I'm not sure about it. So I suppose that the next step is to run again the performance measurements without the validation of the root hash but that will be for tomorrow. |
You really need to learn to take "no" for an answer. I couldn't possibly be any clearer: this is not going to be merged, in this form or in any other, regardless of whatever you come up with, as your performance micro-optimizations have done enough damage already |
How did it even reach the production use case twice? It looks almost like all the changes are blindly taken from GitHub with no testing whatsoever. Maybe that sort of sloppy craftsmanship should be addressed to prevent anything like that from happening again.
Why should anyone care? |
The second round finished with the same result, still no fail after another 4 hours. |
Here are the performance measurements without no root hash validation on the same VM. The old code broke again when the number of services run in // is greater than 20 but at least I got some figures to compare with this PR. Results for old code (commit 48bfc67)
Results for this PR
Result comparison
Again the results are self speaking. This PR is at least 40% faster when the number of services run in // is >= 25 on this system. Let me know if you need other benchmark results. |
Does this change for dissect.c work even without the revert for the udev? Please confirm that. P.S. |
Yes the new logic doesn't depend on the behavior of udevd (whatever it is) since it actually fixes all races that can happen in the old code and thus is the genuine fix for #28141. And yes I tested it. |
I am pretty sure you haven't actually tested anything, not for real anyway, as if you had done any real-world usage of dm-verity you would know that it is not the private property of systemd, but it is abundantly used also via scripts, veritysetup, libcryptsetup, etc - concurrently, obviously. So a private systemd-specific lock, other than being a truly bad idea, cannot possibly work, and re-introduces all the problems (the real ones) that were slowly and painstakingly fixed, all over the place. As anybody who actually uses dm-verity knows, the right place to solve concurrency issues is in the kernel. This is a bad idea, with a broken implementation, that breaks a bunch of real-world use cases, all in the attempt to fix a non-existing problem. |
@poettering, I think it's time to step in. Perhaps you'll be able to explain the presence of this broken (and slow) code in systemd. If that's the case and you're aware of any real use case that could justify its presence please enlight us. Discussions in issue #28141 are also pretty insane in several aspects but more importantly closing the issue by arguing that systemd could rely on a behavior that is clearly documented as undefined (most likely since the beginning) to work around the racy code looks weird, at least. Of course the races are still present and the test case provided in this PR showed that the current code is still failing when 15 services are activating the same verity image in // even if the old "undefined" behavior has been restored. It seems that the whole mess can be simply fixed by a locking approach: it passes all (known) tests, it is faster and it is much simpler. PTAL. |
@poettering @yuwata ping... |
Let's see what CI says...