-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a hardware_destructive_interference_size const to the standard library #1756
Comments
It seems very useful. |
So, basically the cache line size? I've typically just used a hard-coded value of 64 for this. |
Also a |
@Amanieu exactly. That seems... bad? since it's different across architectures. That also seems bad. |
#[cfg(any(target_arch = "x86", ...))]
#[repr(align(64))]
pub struct CacheAligned<T>(pub T);
#[cfg(any(target_arch = "etc", ...))]
#[repr(align(123))]
pub struct CacheAligned<T>(pub T); |
@eddyb seems fine, if it's in the standard. |
Count me interested in this one. I wouldn't want to put lots of architecture-specific cfgs in my code just so as to badly reinvent some constant which LLVM probably natively knows about. |
@HadrienG2 How are we supposed to lift it from LLVM though? All our type layouts and constants can be evaluated without LLVM's input atm. |
This is a very good question, and my current answer will be that I don't know. But I'm going to ask the LLVM team about it and get back to you once I have more details. |
@HadrienG2 I saw your mail to llvm-dev and I think you misunderstood what @eddyb was saying. While rustc is currently using LLVM as its sole backend, it's going to support other backends in the future, and consequently should not be tied too much to LLVM for backend-agnostic issues. One part of this is that type layout computations and constants don't rely on LLVM, so that they can be used by all future backends. So even if LLVM exposed these constants, we probably wouldn't/couldn't reuse them (at least not by interacting with LLVM during type layout computation). |
If this is a concern, I think your best option would be to build a small abstraction layer which can plug into LLVM as well as any other backend you may use in the future, and use that to get the hardware-specific type layout parameter. I guess my main concern here is that I would not want rustc to go through the trouble of building yet another database of hardware characteristics when there are already so many ones available in the wild, and you probably already have a good one right under your feet. |
That doesn't solve the issue of having to duplicate that knowledge, it just pushes the responsibility for it on N other backends. Besides, the specific backends planned (miri and cranelift) won't have this information available, so we'd just push the job of building/stealing such a data base onto these projects (but why should they do that?). I understand your concern but I don't really see any way to do better than rustc cribbing this knowledge from LLVM (assuming it's available over there, which hasn't been confirmed or denied over at llvm-dev — maybe this stuff is solely in libc++, which rustc does not depend on). |
Well, the worst that could happen would be that you do decide to copy/paste this from LLVM, in which case my post on llvm-dev will still have achieved either of the following:
From my perspective, that's still a win :) |
IMO, as this can be target dependent, we should add it as a field to the target spec. At the least it can be CPU dependent which I guess llvm might know if you specify -mcpu, but making it part of target spec makes it backend independent. |
Summary of what has been said so far on llvm-dev:
There also is an API for this. Using TargetTransformInfo, you can call TTI->getCacheLineSize(). Not all targets provide this information, however, and it can be incomplete in some environments like the aforementioned architectures with heterogeneous cores. |
Fwiw there is an architectural upper limit on arm/aarch64, |
Per @HadrienG2 's comment, is it even guaranteed to be constant for a particular architecture? That is, if I compile for, e.g., x86_64, am I guaranteed that, when I get there at runtime, the cache line size won't end up being 128 bytes on this particular CPU? Heck, it sounds from that comment like it can even be different depending on which cache we're talking about (L1, L2, etc). Is there enough stability that we could at least have a compile-time constant that essentially means "this is the cache line size you're most likely to find at runtime" and then also provide a mechanism to query at runtime? Is that an insane idea? I feel like that might be just asking for people to misunderstand and think that the compile-time constant is a guarantee. |
@joshlf : Obviously, there is a limit to the provisions that a static compilation model can provide against the stup... clever ideas of hardware manufacturers, and from time to time a recompilation will always be needed in order to take the characteristics of newer CPUs into account. I think it is fair to request it when a new cache line size is introduced for a given CPU architecture, as that should not happen too often. When the cache line size varies from one hardware architecture implementation to another, if the purpose is to avoid false sharing, the compiler should report an upper bound on the possible cache line sizes, since padding too much is better than padding too little for this use case. If the cache line size differs across the cache hierarchy, the destructive interference size should again be an upper bound taken across the entire cache hierarchy, since the invalidation of any cache hierarchy layer invalidates all layers of the hierarchy above it. Of course, if something like gcc's "march" or "mtune" is specified, the compiler can choose a more optimal cache line size upper bound for the target CPU(s), if it knows about it. Cache line size variability is probably the reason why the C++17 designers chose to mark a distinction between "constructive" and "destructive" interference sizes: the compiler may want not to report the same approximation of the cache line size depending on whether the developer's goal is to promote true sharing or avoid false sharing. We should probably also expose this distinction. Cache optimization is quite an expert topic anyway, so any fear of confusing people by exposing an upper AND a lower bound of the cache line size are probably unfounded. |
Two questions:
|
@HadrienG2 Given your comments, I think that taking the C++17 approach of having both an upper and lower bound exposed sounds like a good idea.
What comment are you referring to?
Maybe I'm misunderstanding something, but I was under the impression that the idea was that the upper bound would be a compile-time configuration value, as would the lower bound. So, for example:
I haven't thought through the math of what size you'd want things to be in both cases, so that's probably oversimplistic, but it should illustrate the idea that both the upper and lower bounds are compile-time configuration constants that you can use to define what alignment you want. |
You misunderstand me. I'm trying to figure out what issue is being solved here. To force nonsharing behavior, all you need is |
@le-jzr Well, for any divisor X of the lower bound (including the lower bound itself), any struct with size <= X can ensure it's placed within a single cache line by aligning to X. Without alignment it might be split across two cache lines, despite being able to fit in one. |
@le-jzr : For any use case of a cache line size upper bound which I would personally envision, the CacheAligned wrapper that you are discussing would be fine. I will let others comment on their own use cases. It is, in any case, quite easy to introduce CacheAligned now and later retrofit it to use the cache line size upper bound constant if that ends up being introduced in the end. As for use cases for a cache line lower bound, besides the misalignment issue raised by @comex, there is also the question of how we could ensure that some data is kept on the same cache line if possible. This is particularly important as rustc is gaining more and more power to dramatically change the memory layout of objects. For the lower/upper bound constant question, I would suggest that if we expose one, we expose the other for symmetry, like in the size_hint() iterator API. But we may well need to expose neither. |
They are always powers of two, so you already know all the divisors. You don't actually need any platform knowledge to optimize for that - just align to the nearest bigger power of two, and you have completely platform-independent code that takes advantage of any cache. If you want to really optimize cache use, you can lay out data in a binary "buddy" pattern, and reap benefits on any platform without conditional compilation. There's one tiny problem with just using the nearest bigger power of two, and that's when the resulting size is bigger than the upper bound. That would give no benefit on any processor, and make things more difficult for the allocator. But using the lower bound doesn't make sense. That would just make code that's slower than the unconditional layout, whenever the cache lines are bigger. |
To illustrate what I'm talking about, imagine a platform that can have no caches at all (lower bound = machine word width), but a particular processor model can have large cache lines that your code would utilize given chance. Taking lower bound into account, you'd optimize your code for no or very small caches, losing all benefits. |
Ending back on this issue after a long while and some extra experience, I would recommend anyone who needs to avoid false sharing to use crossbeam_utils::CachePadded, contributing extra hardware-specific metadata to that crate as needed. It's kind of sad that we need to duplicate LLVM's work here, but it's not too bad if the work remains centralized in crossbeam, and @Amanieu is right that the design of cfg attributes does not really leave us with any other choice. |
http://www.eelis.net/c++draft/hardware.interference
constexpr size_t hardware_destructive_interference_size = implementation-defined;
This number is the minimum recommended offset between two concurrently-accessed objects to avoid additional performance degradation due to contention introduced by the implementation. It shall be at least alignof(max_align_t).
constexpr size_t hardware_constructive_interference_size = implementation-defined;
This number is the maximum recommended size of contiguous memory occupied by two objects accessed with temporal locality by concurrent threads. It shall be at least
alignof(max_align_t)
.The text was updated successfully, but these errors were encountered: