Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a hardware_destructive_interference_size const to the standard library #1756

Open
strega-nil opened this issue Sep 23, 2016 · 26 comments
Open
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.

Comments

@strega-nil
Copy link

strega-nil commented Sep 23, 2016

http://www.eelis.net/c++draft/hardware.interference

constexpr size_t hardware_destructive_interference_size = implementation-defined;

This number is the minimum recommended offset between two concurrently-accessed objects to avoid additional performance degradation due to contention introduced by the implementation. It shall be at least alignof(max_align_t).

struct keep_apart {
  alignas(hardware_destructive_interference_size) atomic<int> cat;
  alignas(hardware_destructive_interference_size) atomic<int> dog;
};

constexpr size_t hardware_constructive_interference_size = implementation-defined;

This number is the maximum recommended size of contiguous memory occupied by two objects accessed with temporal locality by concurrent threads. It shall be at least alignof(max_align_t).

struct together {
  atomic<int> dog;
  int puppy;
};
struct kennel {
  // Other data members...
  alignas(sizeof(together)) together pack;
  // Other data members...
};
static_assert(sizeof(together) <= hardware_constructive_interference_size);
@strega-nil
Copy link
Author

It seems very useful.

@Amanieu
Copy link
Member

Amanieu commented Sep 23, 2016

So, basically the cache line size? I've typically just used a hard-coded value of 64 for this.

@Amanieu
Copy link
Member

Amanieu commented Sep 23, 2016

Also a const isn't very useful since you can't use them with #[align = "N"].

@strega-nil
Copy link
Author

strega-nil commented Sep 23, 2016

@Amanieu exactly. That seems... bad? since it's different across architectures.

That also seems bad.

@eddyb
Copy link
Member

eddyb commented Sep 23, 2016

#[cfg(any(target_arch = "x86", ...))]
#[repr(align(64))]
pub struct CacheAligned<T>(pub T);

#[cfg(any(target_arch = "etc", ...))]
#[repr(align(123))]
pub struct CacheAligned<T>(pub T);

@strega-nil
Copy link
Author

@eddyb seems fine, if it's in the standard.

@nrc nrc added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Sep 28, 2016
@HadrienG2
Copy link

Count me interested in this one. I wouldn't want to put lots of architecture-specific cfgs in my code just so as to badly reinvent some constant which LLVM probably natively knows about.

@eddyb
Copy link
Member

eddyb commented Mar 10, 2017

@HadrienG2 How are we supposed to lift it from LLVM though? All our type layouts and constants can be evaluated without LLVM's input atm.

@HadrienG2
Copy link

This is a very good question, and my current answer will be that I don't know. But I'm going to ask the LLVM team about it and get back to you once I have more details.

@hanna-kruppe
Copy link

hanna-kruppe commented Mar 11, 2017

@HadrienG2 I saw your mail to llvm-dev and I think you misunderstood what @eddyb was saying. While rustc is currently using LLVM as its sole backend, it's going to support other backends in the future, and consequently should not be tied too much to LLVM for backend-agnostic issues. One part of this is that type layout computations and constants don't rely on LLVM, so that they can be used by all future backends. So even if LLVM exposed these constants, we probably wouldn't/couldn't reuse them (at least not by interacting with LLVM during type layout computation).

@HadrienG2
Copy link

HadrienG2 commented Mar 11, 2017

If this is a concern, I think your best option would be to build a small abstraction layer which can plug into LLVM as well as any other backend you may use in the future, and use that to get the hardware-specific type layout parameter.

I guess my main concern here is that I would not want rustc to go through the trouble of building yet another database of hardware characteristics when there are already so many ones available in the wild, and you probably already have a good one right under your feet.

@hanna-kruppe
Copy link

hanna-kruppe commented Mar 11, 2017

That doesn't solve the issue of having to duplicate that knowledge, it just pushes the responsibility for it on N other backends. Besides, the specific backends planned (miri and cranelift) won't have this information available, so we'd just push the job of building/stealing such a data base onto these projects (but why should they do that?). I understand your concern but I don't really see any way to do better than rustc cribbing this knowledge from LLVM (assuming it's available over there, which hasn't been confirmed or denied over at llvm-dev — maybe this stuff is solely in libc++, which rustc does not depend on).

@HadrienG2
Copy link

HadrienG2 commented Mar 11, 2017

Well, the worst that could happen would be that you do decide to copy/paste this from LLVM, in which case my post on llvm-dev will still have achieved either of the following:

  • You get the orders of magnitude from the various posts made there by people like Bruce Hoult or David Abdurachmanov
  • You know which LLVM API/header you should keep in sync with, if such an API exists

From my perspective, that's still a win :)

@parched
Copy link

parched commented Mar 11, 2017

IMO, as this can be target dependent, we should add it as a field to the target spec. At the least it can be CPU dependent which I guess llvm might know if you specify -mcpu, but making it part of target spec makes it backend independent.

@HadrienG2
Copy link

HadrienG2 commented Mar 11, 2017

Summary of what has been said so far on llvm-dev:

  • PowerPC G5 (970) and all recent IBM Power have 128 byte cache lines.
  • Itanium might (?) also have 128 byte cache lines.
  • Intel has stuck with 64 recently with x86, at least at L1.
  • ARM, as usual, is hardware design anarchy. Can be 32, 64, even 128 on Cavium ThunderX. Can even vary within a single chip in heterogeneous designs like big.LITTLE.

There also is an API for this. Using TargetTransformInfo, you can call TTI->getCacheLineSize(). Not all targets provide this information, however, and it can be incomplete in some environments like the aforementioned architectures with heterogeneous cores.

@parched
Copy link

parched commented Mar 11, 2017

Fwiw there is an architectural upper limit on arm/aarch64, from memory it's 1024 2048 bytes. The implementation value can be queried at runtime though.

@joshlf
Copy link
Contributor

joshlf commented May 4, 2017

Per @HadrienG2 's comment, is it even guaranteed to be constant for a particular architecture? That is, if I compile for, e.g., x86_64, am I guaranteed that, when I get there at runtime, the cache line size won't end up being 128 bytes on this particular CPU? Heck, it sounds from that comment like it can even be different depending on which cache we're talking about (L1, L2, etc).

Is there enough stability that we could at least have a compile-time constant that essentially means "this is the cache line size you're most likely to find at runtime" and then also provide a mechanism to query at runtime? Is that an insane idea? I feel like that might be just asking for people to misunderstand and think that the compile-time constant is a guarantee.

@HadrienG2
Copy link

HadrienG2 commented May 4, 2017

@joshlf : Obviously, there is a limit to the provisions that a static compilation model can provide against the stup... clever ideas of hardware manufacturers, and from time to time a recompilation will always be needed in order to take the characteristics of newer CPUs into account. I think it is fair to request it when a new cache line size is introduced for a given CPU architecture, as that should not happen too often.

When the cache line size varies from one hardware architecture implementation to another, if the purpose is to avoid false sharing, the compiler should report an upper bound on the possible cache line sizes, since padding too much is better than padding too little for this use case.

If the cache line size differs across the cache hierarchy, the destructive interference size should again be an upper bound taken across the entire cache hierarchy, since the invalidation of any cache hierarchy layer invalidates all layers of the hierarchy above it.

Of course, if something like gcc's "march" or "mtune" is specified, the compiler can choose a more optimal cache line size upper bound for the target CPU(s), if it knows about it.

Cache line size variability is probably the reason why the C++17 designers chose to mark a distinction between "constructive" and "destructive" interference sizes: the compiler may want not to report the same approximation of the cache line size depending on whether the developer's goal is to promote true sharing or avoid false sharing. We should probably also expose this distinction. Cache optimization is quite an expert topic anyway, so any fear of confusing people by exposing an upper AND a lower bound of the cache line size are probably unfounded.

@le-jzr
Copy link

le-jzr commented May 4, 2017

Two questions:

  • Is there any use case for exposing the static lower bound? Above example of failing build when things are not on a cache line together... seems useless.
  • Is there any use for the upper bound that can't be solved with CacheAligned as proposed by @eddyb?

@joshlf
Copy link
Contributor

joshlf commented May 5, 2017

@HadrienG2 Given your comments, I think that taking the C++17 approach of having both an upper and lower bound exposed sounds like a good idea.

@le-jzr

Is there any use case for exposing the static lower bound? Above example of failing build when things are not on a cache line together... seems useless.

What comment are you referring to?

Is there any use for the upper bound that can't be solved with CacheAligned as proposed by @eddyb?

Maybe I'm misunderstanding something, but I was under the impression that the idea was that the upper bound would be a compile-time configuration value, as would the lower bound. So, for example:

#[cfg(cache_line_size_lower_bound = "64")]
#[repr(align(64))]
pub struct AlignedForSharing<T>(pub T);

#[cfg(cache_line_size_upper_bound = "64")]
#[repr(align(64))]
pub struct AlignedForNonSharing<T>(pub T);

I haven't thought through the math of what size you'd want things to be in both cases, so that's probably oversimplistic, but it should illustrate the idea that both the upper and lower bounds are compile-time configuration constants that you can use to define what alignment you want.

@le-jzr
Copy link

le-jzr commented May 6, 2017

You misunderstand me. I'm trying to figure out what issue is being solved here. To force nonsharing behavior, all you need is CacheAligned<T> wrapper in std, implemention being completely irrelevant. I cannot see any other use for the constants being proposed here.

@comex
Copy link

comex commented May 6, 2017

@le-jzr Well, for any divisor X of the lower bound (including the lower bound itself), any struct with size <= X can ensure it's placed within a single cache line by aligning to X. Without alignment it might be split across two cache lines, despite being able to fit in one.

@HadrienG2
Copy link

HadrienG2 commented May 6, 2017

@le-jzr : For any use case of a cache line size upper bound which I would personally envision, the CacheAligned wrapper that you are discussing would be fine. I will let others comment on their own use cases. It is, in any case, quite easy to introduce CacheAligned now and later retrofit it to use the cache line size upper bound constant if that ends up being introduced in the end.

As for use cases for a cache line lower bound, besides the misalignment issue raised by @comex, there is also the question of how we could ensure that some data is kept on the same cache line if possible. This is particularly important as rustc is gaining more and more power to dramatically change the memory layout of objects.

For the lower/upper bound constant question, I would suggest that if we expose one, we expose the other for symmetry, like in the size_hint() iterator API. But we may well need to expose neither.

@le-jzr
Copy link

le-jzr commented May 6, 2017

@le-jzr Well, for any divisor X of the lower bound (including the lower bound itself), any struct with size <= X can ensure it's placed within a single cache line by aligning to X. Without alignment it might be split across two cache lines, despite being able to fit in one.

They are always powers of two, so you already know all the divisors. You don't actually need any platform knowledge to optimize for that - just align to the nearest bigger power of two, and you have completely platform-independent code that takes advantage of any cache. If you want to really optimize cache use, you can lay out data in a binary "buddy" pattern, and reap benefits on any platform without conditional compilation.

There's one tiny problem with just using the nearest bigger power of two, and that's when the resulting size is bigger than the upper bound. That would give no benefit on any processor, and make things more difficult for the allocator. But using the lower bound doesn't make sense. That would just make code that's slower than the unconditional layout, whenever the cache lines are bigger.

@le-jzr
Copy link

le-jzr commented May 6, 2017

To illustrate what I'm talking about, imagine a platform that can have no caches at all (lower bound = machine word width), but a particular processor model can have large cache lines that your code would utilize given chance. Taking lower bound into account, you'd optimize your code for no or very small caches, losing all benefits.

@HadrienG2
Copy link

Ending back on this issue after a long while and some extra experience, I would recommend anyone who needs to avoid false sharing to use crossbeam_utils::CachePadded, contributing extra hardware-specific metadata to that crate as needed.

It's kind of sad that we need to duplicate LLVM's work here, but it's not too bad if the work remains centralized in crossbeam, and @Amanieu is right that the design of cfg attributes does not really leave us with any other choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

No branches or pull requests

10 participants