-
Notifications
You must be signed in to change notification settings - Fork 154
RISC-V: mttcg is incompatible with atomic PTE updates #103
Conversation
While there is a race, it is a theoretical problem from the perspective of riscv-linux, as it needs to do a SFENCE.VMA on other threads after updating the page tables. Unless we have a reproducible failure, i'd rather we just keep this code as it is. The second, unrelated change, will prevent TLB misses from happening for subsequent stores to a page that are initially accessed with a load. We can only set W permission if the dirty bit is already set or if it is a store. I've discussed this with Richard Henderson. Curious whether qemu x86 PTE accessed and dirty bit updates are atomic. No, they are not. Linux would have the same issue here on x86, however it is theoretical, as one would need to craft code to exercise the race without an SFENCE.VMA (page table updates are done on one thread under the mm lock in linux). The second change is not theoretical. It is likely frequent that we get a load to a page marked RW, and if we don't set the TLB to miss on subsequent write, we'll miss updating the dirty bit. I'm not going to merge this PR in its current form. I'd accept an expansion on the current comment... |
or maybe x86_stl_phys_notdirty has some atomicity magic. I will look... |
Nope. It appears target/i386 PTE updates in qemu are also racy. |
@michaeljclark I asked in #qemu a week ago and the reason x86 works is because x86 does not enable mttcg. Also I don't see how SFENCE.VMA protects you here. The race is:
Regarding the final change, I think you have this backward? I removed a case where PAGE_WRITE is set, so it causes more rewalkings, not less. |
This is what we have presently:
It seems quite clear to me. What am I missing? The old code worked because it only set the prot bits for the current access type, i.e. even more conservative. |
I think the correct fix for the atomicity issue is to use a compare and swap. If the PTE has changed we can't update accessed or dirty, and probably need to return TRANSLATE_FAIL, as the PTE has changed undeaneath us while we were doing the walk. |
That's fine in the context of the current code but if you apply the first two hunks from my patch without the third it will result in lost dirty-bit updates, since MMU_DATA_STORE no longer implies that the page is now dirty. |
I agree completely, but we don't currently have a casq_phys. If you'd prefer I can try to get a casq_phys added to upstream qemu and then we can do this cleaner. |
I'd rather do a compare and swap, and if it fails, return TRANSLATE_FAIL, because the PTE has changed underneath us. i.e. we got new information when trying to update AD bits, which mean the translation is invalid. |
#103 (comment) and #103 (comment) seem to be exactly the same. |
The particular logic and comment for setting PAGE_WRITE is IMHO much clearer than the way x86 does it (sets it then later removes it). When one has a single clause i.e. if PTE.W, only add PROT_WRITE if it is a store or if page is already dirty, then its much easier to understand, as one compact statement, without needing to consider context that obfuscates the reasoning for the code. |
I'm going to try and find a compare and swap intrinsic. We obviously have to return TRANSLATE FAIL if we found that our PTE is no longer there. There are second order issues. What if the PTE space is self-mapped, do we update dirty bits for the mapping for the PTE page itself? (probably not). What about AD bits for non-leaf PTEs? |
PTE AD bit accesses are obviously physical so one wouldn't expect pages that map them to have their dirty bit set, even when the page's underlying physical memory has been changed. Detecting changes to pages holding PTEs is only really an issue for someone who is implementing shadow page tables. However the issue of non-leaf PTE AD bits, I'll have to read the spec... I wouldn't expect them to need to be set. The leaf PTEs, of whatever size, are the level that the OS monitors pages. |
It seems we can use |
actually, it would be best if we just restart the walk if the AD bits compare exchange fails. |
those (A) take virtual addresses, we need something that takes physical addresses (B) longjmp when an I/O region is encountered and are therefore unsafe to call outside TCG-generated code. your requirement to use CAS is making this very complicated. |
In any case, it's a theoretical issue until we get a reproducer. The prot flags for the dirty bit handling in the current form are correct. I can make a test case for dirty bit handling pretty easily as i've written one that does most of this testing already. Easiest approach is to leave the code as it is. Correct approach is to make the PTE update atomic, and re-translate if the PTE is no longer valid. Sure it won't work on IO space, but that is an unusual corner case. |
If you can provide a reproducer for the issue in riscv-linux, and we see the change fixes the issue, then fine... until then its theoretical; a simulation accuracy issue that would require specially crafted code to expose. If there is a genuine bug showing up in riscv-linux, i'd like to be able to reproduce it on my side. i.e. email me a test image. |
I'll see about adding a reproducer to riscv-tests. Will come back when I have one; may be a few weeks. |
Good. It might very well expose hardware bugs. It reminds me of this race (although it's not a hardware bug, it documented behaviour that misaligned atomics only work within a cache line on x86): https://gist.github.com/michaeljclark/31fc67fe41d233a83e9ec8e3702398e8 |
Sagar, redid the load of the PTE before the update to the dirty bits, but that is kind of futile, and the previous code didn't check to see whether it got a different result on the second load. My rationale is if there is a race, adding serveral hundred pico-seconds to it doesn't make much difference. It needs to be atomic. |
If i was to do it in hardware I'd restart the walk, because one needs to use an AMO to update the PTE, but while one is walking, one is doing loads, not AMOs, until one finds a leaf, so there is a race in hardware too. In fact you need to do LR/SC to compare and swap. I guess it needs to take a reservation on the load to the leaf PTE, but given it doesn't know it has a leaf in advance, the page walker would have to take reservations on all page walk loads. Interesting. Usually uncovered hardware bugs become the documented behaviour. Also interesting to try this on x86. |
Here's another test. What if you have a PTE in ROM that doesn't have the AD bits set? Does the AD bit setting bus error cause the translation to fail? Apple puts PTEs in ROM on Arm in iBoot for their secure enclave processor early boot. |
i.e. PTEs in ROM in not theoretical on Arm at least. The spec says to set the AD bits to supress updates. I really don't know how they make it atomic in hardware ? i.e. a potential spec bug (over constraint). Races between loads and stores in the page walker would likely be visible by other threads on x86, if one tried to invalidate the PTE while it was being walked. OS developers tend to use locks around those bits of code. |
That's another bug, that code doesn't handle the case of misconfigured PTEs in ROM. What happens to stores to PTE.AD bits in memory marked RO? That actually hit me in the rv8 page walker as I had PTEs in ROM without the AD bits set. I don't know what the documented behaviour is? |
I think we are going to have racy updates on i386. Maybe not (with some caveats). See #106 That said, The AD bits are in the low 32-bits of the PTE, so we can figure out the address and just update the low 32-bits of the PTE on 32-bit host. It does mean we can only check 22-bits of the PPN (as there are 10 bits of flags). In any case, that would be a corner case of a corner case. i.e. we have to have the combination of a relatively rare race, a 32-bit host, and a physical address space bigger than 34-bits (22+12 or 16GiB) and a PTE update that differs > 16GIB. I can add an #ifdef so that it also works on i386 with the caveat that we're only atomic for 16GiB address space on 32-bit hosts. |
It's a 3rd order corner case. In any case we will need to alter the code to work on the low word of the PTE for RV64 guests on 32-bit hosts, as it doesn't compile on i386. |
We are not going to have racy updates for riscv64-on-i386 because mttcg is disabled then. Search the code for TCG_OVERSIZED_GUEST. |
Fir RV64 on 32-bit, in addition to losing a relatively rare race, the ppn must differ by exactly 16GiB i.e. the low 22-bits of the PPN and flags need to match and the high order bits need to differ. I think its relatively safe to say probabilistically that this is very unlikely to occur. If we restrict physical address space to 16GiB on 32-bit hosts, then it is impossible. |
#105 is a better patch than I expected and I'd rather not discuss it in two places. |
Okay, that sounds better, I'll #ifdef the old non-atomic code path inside TCG_OVERSIZED_GUEST and try compiling it on i386... |
explanation in comments. it doesn't seem possible to test this with linux (linux does not create clean-but-writeable or old-but-valid PTEs), so we'll need to find another way to test it before merging