New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU feature detection in core #3469
base: master
Are you sure you want to change the base?
Conversation
I think it would be useful to have the macro_rules! is_x86_feature_detected {
("sse") => { feature_detect(X86Feature::SSE) };
("sse2") => { feature_detect(X86Feature::SSE2) };
("avx") => { feature_detect(X86Feature::AVX) };
("avx2") => { feature_detect(X86Feature::AVX2) };
// ...
}
#[repr(u16)]
pub enum X86Feature {
// arbitrary order since I didn't bother to look it up,
// we'd probably want to base these off `cpuid` bit positions
// or something similar, since that makes `rust_cpu_feature_detect` much simpler
SSE,
SSE2,
AVX,
AVX2,
// ... lots of elided features
AVX512F, // assume this is the last one
}
impl X86Feature {
pub const MAX: Self = Self::AVX512F; // assume this is the last one
}
#[derive(Copy, Clone, Eq, PartialEq, Debug, Default)]
pub struct X86Features(pub [usize; Self::ARRAY_SIZE]);
impl X86Features {
pub const ARRAY_SIZE: usize = X86Feature::MAX as usize / usize::BITS as usize + 1;
}
extern "Rust" {
// this should have some kind of weak linking or something
fn rust_cpu_feature_detect() -> X86Features;
}
#[inline]
pub fn feature_detect(feature: X86Feature) -> bool {
const Z: AtomicUsize = AtomicUsize::new(0);
static CACHE: [AtomicUsize; X86Features::ARRAY_SIZE] = [Z; X86Features::ARRAY_SIZE];
static CACHE_VALID: AtomicBool = AtomicBool::new(false);
#[cold]
fn fill_cache() {
for (cache, v) in CACHE.iter().zip(&unsafe { rust_cpu_feature_detect() }.0) {
cache.store(*v, Ordering::Relaxed);
}
CACHE_VALID.store(true, Ordering::Release);
}
// intentionally only use atomic store/load to avoid needing cmpxchg or similar for cpus without support for that
if !CACHE_VALID.load(Ordering::Acquire) {
fill_cache();
}
let index = feature as usize;
let bit = index % usize::BITS as usize;
let index = index / usize::BITS as usize;
(CACHE[index].load(Ordering::Relaxed) >> bit) & 1 != 0
} |
That's approximately how |
it's not what @Amanieu proposed? in particular the code has the initialization driven by the threads calling it also needs much less cache space for platforms without atomic |
Ah, well when you put it like that I see the difference :3 |
|
||
## Using a lang item to call back into `std` | ||
|
||
Instead of having `std` "push" the CPU features to `core` at initialization time, an alternative design would be for `core` to "pull" this information from `std` by calling a lang item defined in `std`. The problem with this approach is that it doesn't provide a clear path for how this would be exposed to no-std programs which want to do their own feature detection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lang item with two "grades" - weak and strong, core defines a weak version of the lang item so users are not required to provide their own lang item definition, but anyone can override it with a strong version of the same lang item (libstd will do that as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(A lang item can have a stable alias like #[panic_handler]
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weak lang items is already used as name for lang items like #[panic_handler]
that need not be defined when it is used, but does need to be defined when linking if it was used anywhere. Maybe use preemptible lang items as name here?
Co-authored-by: konsumlamm <44230978+konsumlamm@users.noreply.github.com>
This is what I described in alternative designs here. The main issues are:
This could also be solved by requiring that |
we don't have to have all that API surface stabilized, we can just stabilize the existence of
we'll likely want an atomic check anyway so we can be sure all the features are correctly synchronized with each other.
the problem with that is that users are often specifically trying to avoid static initializers due to ordering issues and stuff: |
There's also the option of including all detection code in libcore, using raw system calls in assembly if OS interaction is needed. It makes libcore OS-specific, but I think that might be fine as long as it doesn't use any library. |
Apart from Linux most OSes don't allow direct syscalls. Go tried, but had to revert back to using libc on most platforms after macOS changed the ABI of one syscall breaking every Go program in existence and OpenBSD added an exploit mitigation that makes your process crash if you try to do a syscall while outside of libc. |
I guess on those systems libcore can just link the C library. After all, it seems that no useful program can exist on those systems without linking the C library (since without system calls, the only things a program can do are to loop infinitely, waste a number of CPU cycles and then crash, or try to exploit the CPU or kernel), so might as well link it from libcore. There's also the issue that in some cases CPU feature detection needs to access the ELF auxiliary values, but argv and envp are passed to static initializers, and the ELF auxiliary entries are guaranteed to be after the environment pointers which are after the argument pointers (at least on x86-64, but I guess all ABIs are like that), so there should be no need to call getauxval to access them. |
might be worth thinking about something like AVX10 which uses point versioning like 10.1, 10.2, etc. |
AVX10 will be more complicated than that because cores can have different capabilities. We currently have no way to represent those differences. We don't even have thread pinning in std which would be a prerequisite to make that work. |
I thought the whole idea of AVX10 was to give cores identical capabilities 🤔 |
Why are we doing this? What use cases does it support? What is the expected outcome? | ||
|
||
This has 2 main benefits: | ||
- It allows ``core`` and `alloc` to use CPU-specific features, e.g. for string processing which can make use of newer CPU instructions specifically designed for this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It allows ``core`` and `alloc` to use CPU-specific features, e.g. for string processing which can make use of newer CPU instructions specifically designed for this. | |
- It allows `core` and `alloc` to use CPU-specific features, e.g. for string processing which can make use of newer CPU instructions specifically designed for this. |
https://cdrdv2.intel.com/v1/dl/getContent/784267 Introduction section on page 14 as of revision 2.0 of the document The same instructions will be supported across all cores. But maximum register width may vary across cores. Specifically P-cores may support ZMM registers while E-cores may be limited to YMM. And in the CPUID table on page 15 I'm not seeing anything to query the minimum set across all cores instead of the current CPU... |
Rendered
This RFC moves the
is_*_feature_detected
macros intocore
, but keeps the logic for actually doing feature detection (which requires OS support) instd
.