Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-temporal stores (and _mm_stream operations in stdarch) break our memory model #114582

Open
RalfJung opened this issue Aug 7, 2023 · 61 comments
Labels
I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness O-x86_32 Target: x86 processors, 32 bit (like i686-*) O-x86_64 Target: x86-64 processors (like x86_64-*) P-medium Medium priority T-lang Relevant to the language team, which will review and decide on the PR/issue. T-opsem Relevant to the opsem team

Comments

@RalfJung
Copy link
Member

RalfJung commented Aug 7, 2023

I recently discovered this funny little intrinsic with the great comment

Emits a !nontemporal store according to LLVM (see their docs). Probably will never become stable.

Unfortunately, the comment is wrong: this has become stable, through vendor intrinsics like _mm_stream_ps.

Why is that a problem? Well, turns out non-temporal stores completely break our memory model. The following assertion can fail under the current compilation scheme used by LLVM:

static mut DATA: usize = 0;
static INIT: AtomicBool = AtomicBool::new(false);

thread::spawn(|| {
  while INIT.load(Acquire) == false {}
  assert_eq!(DATA, 42); // can this ever fail? that would be bad
});

nontemporal_store(&mut DATA, 42);
INIT.store(true, Release);

The assertion can fail because the CPU may order MOVNT after later MOV (for different locations), so the nontemporal_store might occur after the release store. Sources for this claim:

  • Peter Cordes answer here: "A mutex unlock on x86 is sometimes a lock add, in which case that's a full fence for NT stores already. But if you can't rule out a mutex implementation using a simple mov store then you need at least sfence at some point after NT stores, before unlock."
  • glibc fixing their memcpy (which uses nontemporal stores) to have a trailing sfence.

This is a big problem -- we have a memory model that says you can use release/acquire operations to synchronize any (including non-atomic) memory accesses, and we have memory accesses which are not properly synchronized by release/acquire operations.

So what could be done?

  • Remove nontemporal_store and implement the _mm_stream intrinsics without it and mark them as deprecated to signal that they don't match the expected semantics of the underlying hardware operation. People should use inline assembly instead and then it is their responsibility to have an sfence at the end of their asm block to restore expected synchronization behavior.
  • Change the way release stores are compiled such that an sfence is emitted. This is mostly a theoretical option though: this is an ABI-breaking change, at least my understanding of the x86 ABI is that a regular mov is considered to be sufficient synchronization. To make this work reliably all compilers for all languages that have something like release writes need to emit the sfence, or else Rust code cannot be soundly linked with code produced by those other compilers.
  • There's a third hypothetical option of adjusting our concurrency memory model to be able to support these operations. But that's (a) a huge design space -- which fences are supposed to interact how with this? We might even have to add a new kind of fence! And (b) modifying the C++ concurrency memory model should be done with utmost care and thorough formal analysis; the current model is the result of a decade worth of research that shouldn't be thrown away lightly.
  • Just ignore the problem and hope it doesn't explode? I am not happy with this.

Thanks a lot to @workingjubilee and @the8472 for their help in figuring out the details of nontemporal stores.

Cc @rust-lang/lang @Amanieu

Also see the nomination comment here.

@RalfJung RalfJung added O-x86 O-x86_64 Target: x86-64 processors (like x86_64-*) T-lang Relevant to the language team, which will review and decide on the PR/issue. I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness labels Aug 7, 2023
@rustbot rustbot added needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Aug 7, 2023
@RalfJung
Copy link
Member Author

RalfJung commented Aug 7, 2023

I should have also Cc @rust-lang/opsem

@RalfJung RalfJung added the T-opsem Relevant to the opsem team label Aug 7, 2023
@the8472
Copy link
Member

the8472 commented Aug 7, 2023

I believe there are additional options:

C) have the backend lower the non-temporal store to either

  • a regular store
  • MOVNT + deferred sfence

whichever seems more beneficial. The sfence can be deferred up to the next atomic release (or stronger) and things that might contain one.

D) add a Safety requirement to the vendor intrinsic.

@Nilstrieb Nilstrieb removed the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Aug 7, 2023
@digama0
Copy link
Contributor

digama0 commented Aug 7, 2023

It's not clear to me that this "breaks the model" as opposed to "extends the model". Why can't we have nontemporal stores as an explicit part of the atomics model, and say that they are not necessarily synchronized by a release-acquire pair to another thread?

@RalfJung
Copy link
Member Author

RalfJung commented Aug 7, 2023

have the backend lower the non-temporal store to either

Is "deferred sfence" a thing we can tell LLVM to do?

Also I think if we lower _mm_stream_ps in a way that includes an sfence, people will be very surprised. We should rather not expose _mm_stream_ps than expose something that's not the same operation.

add a Safety requirement to the vendor intrinsic.

No that doesn't work. We still need to explain how and why code has UB when it violates safety. Our current memory model has no way to make this code be UB.

This is an intrinsic, not a library function. It is specified by the operational semantics, not a bunch of axioms in the "safety" comment. The safety comment is merely reflecting the constraints that arise from the operational semantics. Making the safety comment the spec would mean making the spec axiomatic and that is something we certainly don't want to do.

Why can't we have nontemporal stores as an explicit part of the atomics model,

We can, but then we'll have to develop our own model. The C++ model does not have such stores.
I don't think we should have our own concurrency model, that well exceeds our capacity. Concurrency models are very hard to get right and being able to inherit all the formal studies of the C++ model is extremely valuable.

And process-wise, we certainly can't have a library feature like stdarch just change the memory model. That requires an RFC. It seems fairly clear that the impact of _mm_stream_ps on the whole language was not realized at the time of stabilization, and I think our only option is to remove this operation again -- or rather, get as close to removing it was we can within our stability guarantees: make it harmless, and deprecate it.

@the8472
Copy link
Member

the8472 commented Aug 7, 2023

Is "deferred sfence" a thing we can tell LLVM to do?

Not that I know. But it may be something llvm should be doing when a store is annotated with !nontemporal. Unless their memory model already knows how to specify nontemporal + release behavior (which may just be UB) because it's broader than the C++ model.

Also I think if we lower _mm_stream_ps in a way that includes an sfence, people will be very surprised. We should rather not expose _mm_stream_ps than expose something that's not the same operation.

I assume long as the sfence is sunk far enough they might not care. Just like mixing AVX and SSE can result in VZEROUPPER being inserted by the backend.

This is an intrinsic, not a library function.

I see. That distinction isn't always obvious. But it makes sense if we want the compiler to optimize around the intrinsics more than around FFI calls.

We can, but then we'll have to develop our own model. The C++ model does not have such stores.

Can we say we use the C++ model with the modification that an axiom (all stores are ordered with release operations) is turned into a requirement (all stores must be ordered with release operations). That seems minimally invasive.

@CAD97
Copy link
Contributor

CAD97 commented Aug 7, 2023

This is an intrinsic, not a library function.

nontemporal_store is our intrinsic, but _mm_stream_ps is a vendor intrinsic. If nontemporal_store is to be directly exposed it certainly needs to participate in the operational semantics. But vendor intrinsics are somewhat special.

Vendor intrinsics are sort of halfway between standard Rust and inline asm. The semantics of the vendor intrinsic is whatever asm the vendor says it is, and it's the responsibility of the user of the intrinsic to understand what that means (or doesn't) on the Rust AM and use the intrinsics in a way consistent on the AM. Ideally, writing _mm_stream_ps(ptr, a) and writing asm!("MOVNTPS {ptr}, {a}") (with the correct flags which I haven't bothered figuring out) should be functionally identical, except that the former is potentially better understood by the compiler and doesn't have the monkeypatchable semantics of inline asm.

Is this complicated by the extra function boundaries imposed by us exposing vendor intrinsics as extern "Rust" fn instead of extern "vendor-intrinsic"? Certainly1, since _mm_stream_ps(ptr, a); _mm_sfence() and asm!("MOVNTPS {ptr}, {a}", "SFENCE") are different, but in a way I think easier to resolve than either fixing compilers to better respect weak/nt memory ordering on x86[_64] or auditing every vendor intrinsic for individually having consistent semantics on the Rust AM.

Slightly reinterpreting: it's perfectly acceptable, even expected for Miri to error when encountering vendor intrinsics. Using them steps outside the AM in a similar manner to inline asm, and as such you're relying on target specifics for coherency and sacrificing the ability to use AM-level sanitizers like Miri except on a best-effort basis.

Footnotes

  1. We should ideally consider if we can get away with "fixing" this such that vendor intrinsics can't be made into function pointers and aren't actually functions, to make their special status more evident. Though abusing the ABI marker for intrinsics is still a bit of a bodge, and this doesn't completely resolve the issues since asm!("MOVNTPS {ptr}, {a}"); asm!("SFENCE") still steps the AM in an inconsistent state between the two asm blocks.

@RalfJung
Copy link
Member Author

RalfJung commented Aug 7, 2023

I see. That distinction isn't always obvious. But it makes sense if we want the compiler to optimize around the intrinsics more than around FFI calls.

FFI calls don't help, FFI calls are only allowed to perform operations that could also be performed by Rust code (insofar as Rust-controlled state is affected). Getting the program into a state where release/acquire synchronization does not work any more is not allowed in any way, not even via FFI.

Like, imagine the following code:

static mut DATA: usize = 0;
static INIT: AtomicBool = AtomicBool::new(false);

thread::spawn(|| {
  while INIT.load(Acquire) == false {}
  let data = DATA;
  if data != data { unreachable_unchecked(); }
});

some_function(&mut DATA, 42);
INIT.store(true, Release);

This code is obviously sound. The compiler is allowed to change this code such that the spawned thread becomes

  while INIT.load(Acquire) == false {}
  if DATA != DATA { unreachable_unchecked(); }

However, if some_function does a non-temporal store, this change can introduce UB! Now the non-temporal store might take effect between the two reads of DATA, and suddenly a value can seem unequal to itself.

Therefore, it is UB to leave an inline assembly block or FFI operation with any "pending non-temporal stores that haven't been guarded by a fence yet". The correctness of compilation relies on this assumption.

Can we say we use the C++ model with the modification that an axiom (all stores are ordered with release operations) is turned into a requirement (all stores must be ordered with release operations). That seems minimally invasive.

No, that's not how defining a memory model works. You'll need to define a new kind of memory access that's even weaker than non-atomic accesses and adjust the entire model to account for them.

The semantics of the vendor intrinsic is whatever asm the vendor says it is,

That doesn't work, they need to be phrased in terms of the Rust AM. Lucky enough they are mostly about what happens to certain values of SIMD type, so the vendor semantics directly translate to Rust AM semantics.

But when it comes to synchronization, this approach falls flat. If _mm_stream_ps was implemented via inline assembly it would be UB.

auditing every vendor intrinsic for individually having consistent semantics on the Rust AM.

There's no way around that, but I hope we don't have many intrinsics that have global synchronization effects.

Slightly reinterpreting: it's perfectly acceptable, even expected for Miri to error when encountering vendor intrinsics. Using them steps outside the AM in a similar manner to inline asm, and as such you're relying on target specifics for coherency and sacrificing the ability to use AM-level sanitizers like Miri except on a best-effort basis.

Even vendor intrinsics are subject to the rule that governs all FFI and inline assembly: you can only do things to Rust-visible state that you could also have done in Rust. _mm_stream_ps violates that rule.

@talchas
Copy link

talchas commented Aug 7, 2023

I hold that while nontemporal_store as intended is not covered by the memory model, it is a) not a huge extension to add it to C/C++'s (though that's ugly to do for a single special case) b) there's a reasonable way you can model what users actually would use it for inside the existing model c) as other people said, no user cares about the purity of the rust AM when doing this, even if you do.

The potential extension to the current C++ intro.races (using C++ mainly because there's nicer html for it) is something like:

  • Nontemporal writes do not participate as A in any of the existing rules
  • A nontemporal write A happens before an evaluation B if
    • A is sequenced before B (ie nontemporal writes do participate in this one existing rule)
    • there is some nontemporal fence (sfence) X and evaluation Y such that A is sequenced before X, which is sequenced before Y, which simply happens before B (ie similar to the strongly happens-before rule)

Obviously this makes happens before even more nontransitive, which is ugly, and it's totally possible I have the details wrong the same way every C/C++ spec has done, but it's really not implausible to do.

Alternatively within the existing rules you can completely ignore hardware and say that they act like this (pretending nontemporal_store is monomorphic on u8):

// blah blah boilerplate
#[derive(Clone, Copy)]
struct SendMe<T>(T);
unsafe impl<T> Send for SendMe<T> {}
unsafe impl<T> Sync for SendMe<T> {}

#[thread_local]
static mut BUFFER: Option<SendMe<&'static Mutex<Vec<(*mut u8, u8)>>>> = None;

pub unsafe fn nontemporal_store(ptr: *mut u8, val: u8) {
    let buf = if let Some(b) = BUFFER {
        b
    } else {
        let b = SendMe(&*Box::leak(Box::new(Mutex::new(Vec::new()))));
        BUFFER = Some(b);
        b
    };
    std::thread::spawn(move || {
        std::thread::sleep_ms(rand::random());
        let buf = buf;
        do_sfence(&buf.0);
    });
    buf.0.lock().unwrap().push((ptr, val));
}

pub fn sfence() {
    unsafe {
        if let Some(b) = BUFFER {
            do_sfence(&b.0)
        }
    }
}

fn do_sfence(buffer: &'static Mutex<Vec<(*mut u8, u8)>>) {
    let mut buffer = buffer.lock().unwrap();
    for (ptr, val) in buffer.drain(..) {
        unsafe { *ptr = val; }
    }
}

ie you do vaguely what the cpu will actually do - buffer all nontemporal stores until (some indeterminate time later, or you do a fencing operation).

This does not permit nontemporal_store(ptr, val); *ptr, which should be permitted, but it does show that modelling nontemporal_store as an opaque FFI operation is sufficient for the compiler. And the actual users can consider nontemporal_store to be either this or just *ptr = val; depending on whether or not they are sure they have the correct fencing, which the actual compiler cannot ruin, because it's opaque. (And miri can either grow something complicated or not detect errors in this use, or just not support this operation at all; any of these are fine)

@CAD97
Copy link
Contributor

CAD97 commented Aug 7, 2023

The semantics of the vendor intrinsic is whatever asm the vendor says it is,

That doesn't work, they need to be phrased in terms of the [language] AM. [abstraction ed.]

While this may be true for intrinsics to be coherently usable from [language], it's not quite true in practice; the Intel documentation is that _mm_stream_ps is the intrinsic equivalent for the MOVNTPS instruction. It's definition is "does this assembly mnemonic operation" and no more, though the compiler is in fairness expected to understand what that means and do what's necessary to make that coherent in the most efficient way possible.

I think my position can mostly be summed up as that if using the vendor intrinsic is (available and) immediate UB in GCC C, GCC C++, LLVM C, LLVM C++, and/or MSVC C++1 (i.e. due to breaking the atomic memory model), it's "fine" if using the intrinsic is immediate UB in rustc Rust. The behavior of the vendor intrinsic is the same as it is in C: an underspecified and incoherent extension to the memory model that might or might not work (i.e. UB).

Though given that presumably the intrinsic should work as intended on Intel's compiler, and I think the Intel oneAPI C++ Compiler is an LLVM, it's presumably handled "good enough" in LLVM for Intel's purposes.

I'd be fine with marking vendor intrinsics as "morally questionable" (for lack of a better descriptor) and potentially even deprecating them if the backend can't handle them coherently2, but I wouldn't want us to neuter them to use different operations. It's assumed at a level equivalent to inline asm that if a developer is using vendor intrinsics that they know what they're doing. It's "just" (giant scare quotes) that knowing what you're doing with _mm_stream_ps and friends probably requires using an Intel-certified compiler with the whatever-Intel-says atomic model rather than the standard C++20 atomic model that most of computation relies on.

My justification for not neutering the implementation is roughly that with mem::uninitialized, it's "our" intrinsic, and we failed to put sufficient requirements on how we told people to use it, so neutering it (i.e. initializing to fill with 0x00 or 0x01) to get it closer to being sound where we previously said it was is the correct move, but for _mm_stream_ps and other vendor intrinsics, they've always been documented as "does what the vendor says, figure it out," so it's not "our fault" if "what the vendor says" is incoherent like it is with mem::uninitialized.


Especially if this family of vendor intrinsics is UB in LLVM C++, this feels like a question that needs to bubble up to Intel's compiler team on how they think the intrinsic should be handled. Because this is a vendor intrinsic, it should behave however the vendor says it should, but we'd absolutely be justified to put a warning on it if it's fundamentally broken. But we shouldn't change how it behaves without input from the vendor.

I see maybe 2½+1 resolutions:

  • The compiler is made properly aware of NT semantics on x86[_64], and strengthens either the NT operation (s/MOVNT(.*)/MOV\1/) or the first following memory_order_release operation (SFENCE or LOCK) to ensure proper atomic memory order is preserved.
  • Intel says SFENCE should not be automatically inserted for NT intrinsics; we thus mark NT intrinsics as broken but leave the implementation unchanged.
  • Intel agrees NT intrinsics are incoherent and retracts them; we thus "remove" NT intrinsics (deprecate and neuter à la mem::uninitialized).
  • Highly unlikely: the C++20 atomic model is coherently extended with a memory_order_nontemporal, matching the semantics of NT operations, and Rust adopts it (unlike memory_order_consume which Rust doesn't expose3).

but hard choosing one without vendor input seems improper.


I find it interesting that NT would be "easier" for Rust if it weren't same-thread consistent, because then it could probably be modeled as an access from an anonymous thread (i.e. exempted from the sequenced-before relation), as modeled by talchas. But it is, so this observation is mostly irrelevant. I make no attempt to say whether the relaxation of the model is accurate, nor whether it breaks the existing proof body built on the current model4. Though I do think you're at least missing a buffer.clear() from your do_sfence (or should be iterating buffer.drain()).

Footnotes

  1. MSVC C only sort of exists, thus my omission of it. Also, I think Microsoft may only provide _mm_stream_ss and not _mm_stream_ps? The MSVC docs for the former link to the Intel docs for the latter (and _mm_sfence).

  2. Ideally marked in such a way that's backend-specific, such that use of broken vendor intrinsics can warn on the backends which don't handle it fully, but not on the vendor backend that does, if such a thing ever comes into existence. Though with MIR semantics not respecting NT operations, the vendor would also need to validate any middleend transforms as well as their backend.

  3. Unrelated tangential query: is the Rust AM memory model "C++20" or "C++20 without memory_order_consume"? IOW, is FFI using memory_order_consume in a way visible to the Rust AM defined or UB? I very much do not know the full story of memory_order_consume and honestly don't particularly care to know much more — knowing how OOTA is permitted to break causality is cursed enough for me — but if memory_order_consume means something in the atomic model which Rust inherits from C++ and uses (i.e. it isn't just aliased to acquire), it should probably be possible from Rust (even if #[doc(hidden)] #[deprecated]) to communicate that reality.

  4. And this risk of breaking the rest of the model is the problematic risk of extending the model. Especially if Rust requires an adhoc nontemporal extension without the C++ model that everyone else is assuming adopting the same extension. You need to prove not just that your extension is sufficient for a nontemporal memory order but that its presence also doesn't impact the rest of the model, such that existing proofs that ignore nontemporal remain accurate so long as the relevant memory locations are not accessed nontemporally.

@talchas
Copy link

talchas commented Aug 7, 2023

Oh yes, I originally wrote that as consuming the buffer and forgot to shift it to drain. (fixed)

@CAD97
Copy link
Contributor

CAD97 commented Aug 8, 2023

Since I got somewhat nerd-sniped:

A spitball extension to the cppreference memory ordering description.

Ultimately I think this came out in similar effect to talchas's two rules, but I worked it out a bit further.

I believe the most straightforward way to extend the C++ atomic model to support nontemporal writes would be to split off a concept of weakly sequenced-before for nontemporal writes in the causal chain, like how happens-before is split to handle consume causality.

The accuracy of my extension is predicated on my understanding of the happens-before relation qualifiers being accurate.

Namely, that inter-thread happens-before is happens-before which includes crossing a thread boundary, simply happens-before excludes consume operations (the dependency-ordered-before relation), and strongly happens-before only matters for establishing the global seq_cst ordering $S$.

As such I'm not particularly interested in sussing out the exact difference of strongly happens-before and simply happens-before, since as $A$ sequenced-before $X$ simply happens-before $Y$ sequenced-before $B$ $\implies$ $A$ strongly happens-before $B$, the difference between the relations is IIUC very small — only when $A$ or $B$ is exactly the operation on one side of an acq/rel synchronizes-with in the causal chain, since strongly happens-before restricts a direct synchronizes-with contribution to those using seq_cst operations — and I simply choose not to write the kind of multiple atomic location using code where seq_cst could even potentially be needed.

Though I am part of the heathen group which would prefer acq_rel be permitted for non-rmw operations, with the semantics of choosing acq/rel as appropriate based on if it's a load/store. Maybe that lessens the validity of my opinions somewhat, but I legitimately believe that a significant part of the reason that "just use seq_cst" is a thing is not because seq_cst is stronger than acq_rel (I find that the added guarantees actually make the behavior harder to reason about, and it doesn't even make an interleaving based understanding of many thread interactions any less wrong) but rather more because you can just use seq_cst everywhere, instead of needing to choose between acq, rel, or acq_rel based on if the operation is a load, store, or rmw. It would make confusion around (the non-existence of) "release loads" and "acquire stores" worse, but I personally think the benefit of being able to use acq_rel everywhere when consistently using acq_rel syncronization outweighs that, since in that paradigm the use of an aquire or release rmw operation being part-relaxed is more immediately evident.


We define $A$ weakly sequenced-before $B$ as defined by the language evaluation order. $A$ sequenced-before $B$ now requires both $A$ weakly sequenced-before $B$ and that $A$ is not a nontemporal write.

Downstream sequenced-before requirements are adjusted as:

  • carries a dependency into still uses sequenced-before, excluding nontemporal writes. (consume ordering is rather cursed; notes after.)
  • inter-thread happens-before still uses sequenced-before, excluding nontemporal writes.
  • happens-before is relaxed to require weakly sequenced-before. NB: happens-before is not transitive, and relies on the separate transitivity of its two sufficient relations, weakly sequenced-before and inter-thread happens-before.
  • simply happens-before and strongly happens-before still uses sequenced-before, excluding nontemporal writes.

(This is the dangerous part. The notes about consume ordering w.r.t. simply happens-before and strongly happens-before w.r.t. consume should now say consume and nontemporal. That nontemporal ends up handled similarly to how consume is (i.e. the split between transitive inter-thread happens-before and non-transitive happens-before) I think makes the problems fairly clear, given the known problems with consume ordering. But if consume is formally coherent in the current definition, even if no compiler actually treats it as weaker than acq_rel, then nontemporal can be coherent with this (or a similar) formalization, even if no compiler will actually treat it as weaker than a standard write.)

Why does carries a dependency into exclude nontemporal writes? It only matters with release-consume synchronization, but if a chain of weakly sequenced-before $\implies$ dependency ordered-before $\implies$ inter-thread happens-before $\implies$ happens-before can stand, we have a way to require nontemporal writes to be visible cross-thread without guaranteeing an sfence (bad) and that release-consume synchronization is stronger than release-acquire synchronization (even worse). So nontemporal writes simply don't participate in dependency ordering, which in retrospect clearly sounds like the correct result.

This doesn't cover fences, but neither does the cppreference memory ordering description. If my understanding of fences is accurate — that an atomic fence (enters $S$ if a seq_cst fence and) upgrades any atomic loads (any ordering) sequenced-before it to acquire (if an acquire fence) and any atomic stores (any ordering) that it is sequenced-before to release (if a release fence), except that the synchronization occurs with respect to the sequence point of the fence instead of of the fenced operation(s) — then a sequence fence can be handled with the addition of one more rule: that $A$ weakly sequenced-before sequential fence $F_B$ $\implies$ $A$ sequenced-before $F_B$ ($\therefore$ $A$ now participates in inter-thread happens-before).

Phrasing other fences in these terms:

Given atomic store (any order) $X$ and atomic load (any order) $Y$, I will use " $X$ would synchronize-with $Y$" as shorthand for " $Y$ reads the value written by $X$ (or by the release sequence headed by $X$ were $X$ a release operation)." Equivalently, $X$ would synchronize-with $Y$ if-and-only-if were $X$ a release store and $Y$ an acquire load then $X$ would synchronize-with $Y$. (Though fence synchronizes-with might be a better name, or perhaps weakly synchronizes-with in analogy with how we handled sequential fences... or maybe rename weakly sequenced-before to fence sequenced-before?)

  • release fence $F_A$ sequenced-before atomic store (any order) $X$ and $X$ would synchronize-with acquire load $B$ $\implies$ $F_A$ synchronizes-with $B$
    • $\therefore$ $F_A$ happens-before $B$.
  • release store $A$ would synchronize-with atomic load (any order) $Y$ and $Y$ sequenced-before acquire fence $F_B$ $\implies$ $A$ synchronizes-with $F_B$
    • $\therefore$ $A$ happens-before $F_B$.
  • release fence $F_A$ sequenced-before atomic store (any order) $X$ and atomic load $Y$ sequenced-before acquire fence $F_B$ and $X$ would synchronize-with $Y$ $\implies$ $F_A$ synchronizes-with $F_B$
    • $\therefore$ $F_A$ happens-before $F_B$.

(In the above, and binds more tightly than $\implies$, thus [ $I$ and $J$ $\implies$ $K$] parses as [($I$ and $J$) together imply $K$].)


NB: since atomic operations (including fences) are not nontemporal writes, the weakly sequenced-before and sequenced-before relations between them are definitionally identical.


A primary benefit of adding nontemporal operations in this manner is that it should be trivially true that if no nontemporal operations are involved, nothing changes. This means that the extension cannot make the situation any worse for people relying on the memory order guarantees than it currently is; at worst nontemporal operations go from immediate UB to broken in a different manner. It could still of course muck things up for people in charge of maintaining the memory order (compilers), but I believe talchas's surface language fn nontemporal_store shows that this doesn't restrict reasoning around fully unknown (potentially thread spawning) functions.

The only terms which are impacted are happens-before, visible (side effect), visible sequence of side-effects, and the newly added weakly sequenced-before.

The chance of me ever actually proposing this extension anywhere is effectively 0. But I do, to the full extent legally possible, waive my copyright for the contents of this post/comment, releasing it into the public domain. If you want to do something with this wording, feel free.


However, even if the above would potentially work as a definition, what it definitely does not show is whether nontemporal stores are currently nonsense (given compilers' models) and "miscompiled" (given this model) by existing compiler transforms. I think it's probably fine — a code transformation relying on weakly sequenced-before (nontemporal) being sequenced-before would necessarily require inter-thread reasoning, and I don't think any existing compiler transforms do such, since "no sane compiler would optimize atomics" — but that's an extremely strong assertion to make.


Separately, and actually somewhat relevant — Rust might still not be able to simply use the C++20 memory model completely as-is if we permit mixed-size atomic operations. Though of course, even if the formal C++ description leaves such as a UB data race, the extension to make them at least not race is fairly trivial, and IIRC I don't think anyone expects them to actually synchronize.

Though I do think people would expect mixed-size atomic operations to be coherent w.r.t. modification order, and those rules talk about some "atomic object $M$" and not a "memory location," and I'm no longer sure the necessary modifications are in any way simple.

@programmerjake
Copy link
Member

because LLVM currently compiles all but sequentially-consistent fences to no-ops on x86, I think it currently miscompiles non-temporal stores if they're anything remotely like normal stores, because you can have acquire/release fences as much as you please and LLVM will happily compile them to nothing. imho the extra guarantees provided by sequential consistency shouldn't be required to make non-temporal stores behave.

https://rust.godbolt.org/z/1dhEKM5oa

@programmerjake
Copy link
Member

there's also non-temporal loads which seem to also be just as much fun!
https://www.felixcloutier.com/x86/movntdqa

@workingjubilee
Copy link
Contributor

Non-temporal loads, in practice, are effectively normal loads, because their optimization is so conditional it can rarely trigger according to the letter of the ISA (it requires close reading of the fine print to tease this out). Because it is such a theoretical optimization, my understanding is it is largely unimplemented, and the one vestige is that the prefetchnta instruction is sometimes supported.

@talchas
Copy link

talchas commented Aug 8, 2023

Yes, you need the instructions that happen to be generated by sequential consistency fences, or explicit calls to arch intrinsics/asm like sfence. If you wanted nontemporal stores to behave like normal stores in the language model, you wouldn't need a fence at all; requiring a release fence and pessimizing its codegen for other cases would be a weird worst of all worlds imo. (Requiring a seqcst fence that is in practice always going to generate an acceptable instruction or requiring an explicit sfence both seem fine to me)

And yeah, for loads note the "if the memory source is WC (write combining) memory type" in the instruction description - WC is not the normal memory type, it's the weak memory type and if your rust code built for x86_64-any-target-at-all has access to any memory of that type it's already broken wrt synchronization. (NT stores just treat any memory like it's WC)

@programmerjake
Copy link
Member

And yeah, for loads note the "if the memory source is WC (write combining) memory type" in the instruction description - WC is not the normal memory type, it's the weak memory type and if your rust code built for x86_64-any-target-at-all has access to any memory of that type it's already broken wrt synchronization. (NT stores just treat any memory like it's WC)

well, we still need to be able to have Rust properly handle it because WC memory is often returned by 3D graphics APIs (e.g. Vulkan) for memory-mapped video memory.

@talchas
Copy link

talchas commented Aug 8, 2023

Well uh that'll be fun if you wanted to program to spec, because any store there is basically a nontemporal store and isn't flagged as such in any way to the compiler.

It'll work of course so long as you either sfence manually in the right places or never try to have another thread do the commit (which presumably will do the right sync, but needs to happen on the cpu that did the WC write), since any memory given to you like that will be known by the compiler to be exposed, and asm sfence/etc would be marked as touching all exposed memory. (Don't make an &mut of it though probably? Who knows what llvm thinks the rules around noalias + asm-memory are)

@RalfJung
Copy link
Member Author

RalfJung commented Aug 8, 2023

I hold that while nontemporal_store as intended is not covered by the memory model, it is a) not a huge extension to add it to C/C++'s

It requires an entire new access class, so it is a bigger change than anything that happened since 2011. It also requires splitting "synchronizes-with" into "synchronization through seqcst fences" (which does synchronize nontemporal accesses) and everything else (which does not synchronize nontemporal accesses). So this is a huge change to the model, and all consistency theorems (like DRF), compiler transformations, lowering schemes etc will have to be carefully reconsidered. We currently have proofs showing that the x86 instructions compilers use to implement memory operations give the right semantics. It would be a mistake to lower our standards below this, mistakes have been made in the past.

It is just false to claim that this is a simple extension.

Any attempt to do this needs to start with something like https://plv.mpi-sws.org/scfix/paper.pdf, the formulation of the model in the standard is just too vague to be reliable. (According to the authors of that paper, the standard doesn't even reflect the SC fix properly even though that was the intention of the standard authors.)

The C++ memory model needed many rounds of iteration with formal memory model researchers to get into its current state. It still has one major issue (out-of-thin-air), but the original version (before Mark Betty's thesis) was a lot worse and even after that some things still had to be fixed (like SCfix). Anyone who claims they can just quickly modify this model and be sure not to reintroduce any of these problems is vastly underestimating the complexity of weak memory models.


@CAD97

It's definition is "does this assembly mnemonic operation" and no more, though the compiler is in fairness expected to understand what that means and do what's necessary to make that coherent in the most efficient way possible.

Again that just doesn't work. We have a very clear rule for inline assembly and FFI: they can only do things that Rust code could also have done, insofar as Rust-visible state is affected. It is completely and utterly meaningless to take an operation from one semantics and just take it into another, completely different semantics.

@RalfJung
Copy link
Member Author

RalfJung commented Aug 8, 2023

because LLVM currently compiles all but sequentially-consistent fences to no-ops on x86, I think it currently miscompiles non-temporal stores if they're anything remotely like normal stores, because you can have acquire/release fences as much as you please and LLVM will happily compile them to nothing. imho the extra guarantees provided by sequential consistency shouldn't be required to make non-temporal stores behave.

Well they don't say much about how nontemporal stores are supposed to behave, so it's unclear if they are miscompiling or just implementing a surprising (and unwritten) spec. I opened llvm/llvm-project#64521 to find out.

@CAD97
Copy link
Contributor

CAD97 commented Aug 8, 2023

We have a very clear rule for inline assembly and FFI: they can only do things that Rust code could also have done, insofar as Rust-visible state is affected. It is completely and utterly meaningless to take an operation from one semantics and just take it into another, completely different semantics.

To be clear, the point I'm making is actually that if the definition as "does this assembly sequence" has busted and unusable semantics on the AM, then the vendor intrinsic is busted and unusable, which is IIUC in agreement with you.

The main point where I disagree is that, given Intel has defined a busted and unusable intrinsic, the correct behavior is for us to expose the intrinsic as-is, busted and unusable though it may be, potentially with an editorial note saying as much.

basically saying the same thing again twice over but I didn't want to delete it

We don't specify the behavior of vendor intrinsics, the vendor does, and if the vendor specified an unusable intrinsic, that's on them to fix, not us. Changing the implementation of the intrinsic to be busted but still usable would be exposing an incorrect intrinsic.

I fully agree that our current semantics don't support the intrinsic, though I do wish it was possible to temporarily suspend the AM's knowledge of a memory location in regions that a spurious or reordered load can't reach, in order to make it, well, not supported, but marginally less unsupported. But, and despite understanding the implications of UB, I do believe "it's always UB (but might accidentally happen to do what you want sometimes)" (the current situation) is preferable to not doing what the intrinsic says on the tin because what it says is "causes UB" (but obfuscated).

(Though the intrinsic might actually be soundly usable on a pointer which is never deferenced, retagged, nor otherwise accessed from Rust code, such that the Rust AM is never aware of the AM-invalid allocated object even existing. But such isn't anywhere near practical so is mostly irrelevant even if potentially sound.)


on a memory model extension

[NT store] also requires splitting "synchronizes-with" into "synchronization through seqcst fences" (which does synchronize nontemporal accesses) and everything else (which does not synchronize nontemporal accesses).

While this is true of the Intel semantics, I would personally expect the memory model to not include sfence effects into a seqcst fence (my spitball formalization deliberately doesn't) and require users to either write both or to rely on the target architecture having a strong memory ordering plus "no sane compiler would optimize atomics."

However, further discussion on extending the model to support NT stores (as opposed to documenting/mitigating for what we have currently) should probably stick to Zulip. Any such movement is at best years out. I only included this point here since this post has other (maybe) useful content.


@RalfJung
Copy link
Member Author

RalfJung commented Aug 8, 2023

(I have edited the OP to explicitly state that adjusting the memory model is an option in principle, but not one I consider realistic.)

The main point where I disagree is that, given Intel has defined a busted and unusable intrinsic, the correct behavior is for us to expose the intrinsic as-is, busted and unusable though it may be, potentially with an editorial note saying as much.

I don't see why we would expose a busted intrinsic. We already have inline assembly, and these days it is stable (it was not when stdarch got stabilized), so I think we are totally free to tell Intel that we won't expose operations that don't compose with the rest of our language -- for reasons that are entirely on them, since they told the world that their hardware is TSO(-like) and then also added operations which subvert TSO.

If these intrinsics weren't stable yet, is anyone seriously suggesting we would have stabilized them, knowing what we do now? I would be shocked if that were the case. The intrinsic even says in its doc comment that this will likely never be stabilized! Sadly whoever implemented _mm_stream_ps didn't heed that comment. So to me this is a clear oversight and the question is how we best do damage control.

@RalfJung
Copy link
Member Author

RalfJung commented Aug 8, 2023

I wonder if there is some way that we can argue that

_mm_stream_ps();
_mm_sfence();

is equivalent to a single inline assembly block with both of these operations. The CPU is in a strange state between those two inline assembly blocks, but can we formulate some conditions under which that strange state cannot have negative side-effects? IOW, can we weaken the rule that inline asm can only do things which Rust could have done, to something where one inline asm block gets the machine into a state that does not fully match what Rust could have done (but the difference only affects a few operations), and then a 2nd inline asm block fixes up the state?

Basically what's important is that there is no release operation between the two intrinsics. If the programmer ensures this is the case, then -- can we ensure the compiler doesn't introduce such operations? At the very least this means ensuring the inline asm blocks don't get reordered with release operations, but is that sufficient? I am a bit worried this might be too fragile, but OTOH a principle of "it's fine if machine state the Rust AM doesn't currently need is inconsistent" does seem useful. It's just hard for me to predict the impact of this on optimizations.

@talchas
Copy link

talchas commented Aug 9, 2023

Yes, I did them in order because it was simplest, and because doing two NT stores to the same location (with different values) and expecting the later one to win once you sfence() seemed plausible for graphics code or whatever and it's guaranteed by x86. Instead doing a "on insert replace any existing value" and "have each spawned thread only write a single random element" would be closer to the actual behavior, but way more of a pain to write, and as you note I don't think it actually is visible.

Mixing NT store + regular store is also guaranteed by x86 and an opaque asm block will provide enough constraint to the compiler to do the right thing, but specifying that as a rust implementation is not something I can even begin to see how to do.

@RalfJung
Copy link
Member Author

RalfJung commented Aug 10, 2023

doing two NT stores to the same location (with different values) and expecting the later one to win

Ah, that's a good point, so "randomly pick things from the buffer" is not equivalent. We should check that paper to figure out if their model guarantees same-thread ordering of MOVNT.

It does seem strange though that a regular write would be UB but an NT write would be allowed. I think ideally we make it so that every NT write can legally be replaced by a regular write. So I think I'd prefer a model that flushes the buffer out-of-order.

#[thread_local]
static mut BUFFER: Option<SendMe<&'static Mutex<Vec<(*mut u8, u8)>>>> = None;

pub unsafe fn nontemporal_store(ptr: *mut u8, val: u8) {
    let buf = if let Some(b) = BUFFER {
        b
    } else {
        let b = SendMe(&*Box::leak(Box::new(Mutex::new(Vec::new()))));
        BUFFER = Some(b);
        // spawn one flushing thread per "real" thread.
        std::thread::spawn(move || loop {
            std::thread::sleep_ms(rand::random());
            let buf = b.0.lock().unwrap();
            if !buf.is_empty() {
                let (ptr, val) = buf.remove(rand::random() % buf.len());
                *ptr = val;
            }
        })
        b
    };
    buf.0.lock().unwrap().push((ptr, val));
}

pub fn sfence() {
    unsafe {
        if let Some(b) = BUFFER {
            // Just wait until the background thread drained the buffer.
            while b.0.lock().unwrap().len() > 0 {}
        }
    }
}

@talchas
Copy link

talchas commented Aug 10, 2023

Every difference compared to the actual intrinsic is downside, so no, I don't think that's a good idea (if std wanted to expose a nontemporal_store outside of stdarch, then maybe, but not for _mm_sfence). Weird corner cases that only exist because of trying to shoehorn this into a spec aren't really a problem. It doesn't even make the fake code representation of the intrinsic look better.

(Also a bit of the point is that between the two possible asm blocks the compiler can't actually screw up "asm!("movnti" ...), the spec is movnti", so if someone did want to write that they could, for all that you'd refuse because AM)

@RalfJung
Copy link
Member Author

RalfJung commented Aug 11, 2023

The actual MOVNT does have the property that it is safe to replace it by a MOV. Though I agree that adding extra UB to satisfy a property like this is not a clear win.

But actually my proposal wouldn't even make it so that doing two MOVNT to the same location would be UB -- it would just be non-deterministic which write would win. So I'm not happy with that proposal either. I think this one has the desired effect:

#[thread_local]
static mut PENDING_WRITES = AtomicUsize::new(0);

pub unsafe fn nontemporal_store<T>(ptr: *mut T, val: T) {
    PENDING_WRITES.fetch_add(1, Relaxed);
    // Spawn a thread that will eventually do our write.
    let ptr = SendMe(ptr);
    let pending_writes = SendMe(addr_of!(PENDING_WRITES));
    std::thread::spawn(move || {
        let ptr = ptr; let pending_writes = pending_writes; // closure field capturing is annoying...
        std::thread::sleep_ms(rand::random()); // not really needed due to scheduler non-determinism
        *ptr.0 = val;
        (&*pending_writes.0).fetch_sub(1, Release);
    });
}

pub fn sfence() {
    unsafe {
        // Wait until there are no more pending writes.
        while PENDING_WRITES.load(Acquire) > 0 {}
    }
}

Here's a variant that actually builds and also uses Box::leak to avoid a use-after-free in the write-back thread.

This also has the nice advantage of making it much easier to support writing arbitrary types.

(Also a bit of the point is that between the two possible asm blocks the compiler can't actually screw up "asm!("movnti" ...), the spec is movnti", so if someone did want to write that they could, for all that you'd refuse because AM)

🤷 they cold write whatever they want, of course, but it would not be a solid argument showing that the compiler cannot screw up. "We don't know a counterexample" is just not strong enough evidence IMO. Considering "We don't know a counterexample" good enough is what lead to LLVM's semantics for uninit and pointer provenance being an inconsistent mess. Rust should strive to do better than that.

@apiraino
Copy link
Contributor

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-medium

@rustbot rustbot added P-medium Medium priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Aug 15, 2023
@joshtriplett
Copy link
Member

joshtriplett commented Aug 18, 2023 via email

@RalfJung
Copy link
Member Author

RalfJung commented Aug 19, 2023

That's not a fatal flaw; sometimes being non-deterministic for
performance is OK, as long as there's a well-defined way to be
deterministic if you want to be, such as by adding additional barriers.

Yeah but it was my intent for that to be UB.^^ (And barriers make it defined, of course.) That is achieved by my later proposal, which can be summarized very succinctly: a nontemporal store is like starting a new background thread that will do the actual store using non-atomic ordering at some point in the future. The fence is waiting for all background threads that were spawned by nontemporal stores of the current thread. The hope is that all the behavior of nontemporal stores can be described with this model.

If we want to define that behavior even without fences we should go with something closer to @talchas' original proposal. This can be summarized as: there is a per-thread write-back buffer for nontemporal stores. A background thread flushes the entire buffer at non-deterministic intervals using non-atomic writes. (There is no partial flushing, it's always flushing the entire thing.) The fence flushes the entire buffer right then and there.

In both cases, MOVNT; MOV to the same location is UB, since there is a race between the write-back thread and the MOV in the real thread. MOVNT; SFENCE; MOV is fine since the fence guarantees that the write-back happened so there cannot be a race.

@eduardosm
Copy link
Contributor

eduardosm commented Sep 8, 2023

The _mm_maskmoveu_si128 intrinsic uses the maskmovdqu instruction.

According to Intel documentation, it looks like a non-temporal store, even though Rust documentation does not mention it.

Conditionally store 8-bit integer elements from a into memory using mask (elements are not stored when the highest bit is not set in the corresponding element) and a non-temporal memory hint. mem_addr does not need to be aligned on any particular boundary.

@talchas
Copy link

talchas commented Sep 8, 2023

maskmovdqu, not movmskpd, they're very different instructions (thanks intel), but otherwise yes, it (and maskmovq and all of movnt*) is a nontemporal store.

@tmandry
Copy link
Member

tmandry commented Oct 10, 2023

Proposal: Can we lint on these intrinsics for now (by marking them as deprecated, or implementing a dedicated lint), warning people that they are unsound, and figure out how to address them longer term in the future?

This was discussed in the lang team meeting. We recognize that there needs to be more in-depth discussion on this, but a lint would be a useful band-aid for the future.

If, on the other hand, we agree that it is possible to use these lints safely, we should update the documentation on how to do that, and consider adding a lint that looks for unsound uses of them.

@traviscross
Copy link
Contributor

If we do add a lint, we should make sure to tell people that if they do want to live on the wild side and use these (as clearly some people do), that they need to add the SFENCE, as that often seems to be neglected.

@RalfJung
Copy link
Member Author

RalfJung commented Oct 11, 2023

I'm still in favor of the approach described above:

  • reimplement these functions in inline asm (since LLVM isn't properly modeling nontemporal stores so there's no way we can rely on it doing anything sensible here)
  • extend the documentation to explain that they behave like this Rust code, or like this one.

@traviscross
Copy link
Contributor

@rustbot labels -I-lang-nominated

I'm going to unnominate because, while this issue is important, T-lang is not likely to have much more to say about it without a document that explores the issue and the feasible resolutions in some detail. Or additionally or alternatively, T-lang would likely have something more to say about a PR adding e.g. a lint (as suggested above) or about a specific addition to the documentation.

Please nominate any of those things, or re-nominate this issue if there's new information and a specific question that T-lang might be able to answer.

@rustbot rustbot removed the I-lang-nominated The issue / PR has been nominated for discussion during a lang team meeting. label Oct 24, 2023
@RalfJung
Copy link
Member Author

T-lang is not likely to have much more to say about it without a document that explores the issue and the feasible resolutions in some detail

This comment (linked from the top comment) was supposed to be such a "document" / summary.

@workingjubilee
Copy link
Contributor

workingjubilee commented Oct 25, 2023

Sorry but, as far as I can tell, "a document that explores the issue in detail" means nothing if the information provided so far is insufficient. I am assuming that T-lang does not literally mean they need us to copy and paste that comment's body text into a HackMD link.

May we have the pleasure of T-lang specifying what they feel is actually needed in terms of "exploring the issue in detail"?

Admittedly, the text offered is incomplete.

@tmandry I haven't said it yet in this thread, but the problem actually does not start or end with these intrinsics, so asking about linting on them alone is a distraction. The property attributed to "non-temporal stores" is a property that can apply to anywhere in memory if the page is mapped appropriately, and sometimes such interactions can happen for graphics API code which exposes the ability to intentionally map memory pages appropriately to userspace, and which is used to obtain high-performance graphics (in the sense that for the past 10-20 years we have had "high-performance graphics" on computers). Because it makes a lot of things a great deal more efficient for certain write-heavy sequences. Unlike "you can open /proc/self/mem and sabotage yourself on purpose", this is done intentionally for not-a-joke reasons by ordinary programs and is expected to have a meaningful operational semantics.

@RalfJung
Copy link
Member Author

RalfJung commented Oct 25, 2023

Admittedly, the text offered is incomplete.

Yeah, that's fair.

I think my preferred way to proceed is a PR against stdarch that

  • changes the implementations to inline asm, so that we avoid the LLVM nontemporal flag (which clearly LLVM itself has no clue how to treat properly so we should take a wide berth around it)
  • document that these stores behave like "spawn a background thread that does the store eventually (as a non-atomic store)" and "sfence synchronizes with all background threads that were previously spawned by the current thread". Accesses that race with that background thread store lead to data races and hence UB.

Once we have that PR we can ask t-lang for FCP on that I guess. I'd be happy to write the docs but I'd probably need help with the inline asm part.

@Nilstrieb Nilstrieb added O-x86_32 Target: x86 processors, 32 bit (like i686-*) and removed O-x86-all labels Oct 25, 2023
@tmandry
Copy link
Member

tmandry commented Oct 27, 2023

That sounds like a reasonable way forward to me, @RalfJung. For the larger questions, it sounds like we should probably have a design meeting (or an RFC) dedicated to the topic.

@RalfJung
Copy link
Member Author

I've opened rust-lang/stdarch#1534 to update the stdarch docs. However I need help updating the implementations of these intrinsics to move away from the !nontemporal LLVM attribute -- either using LLVM functions specific for a particular instruction, or if that is not possible, using inline assembly.

Where can I find out which instruction-specific LLVM functions exist? (I mean functions like llvm.x86.sse4a.movnt.sd.)

@RalfJung
Copy link
Member Author

And the last step, moving away from LLVM's !nontemporal which (from what can be gleaned from the LangRef) entirely fails to account for the fact that this has subtle interactions with the memory model: rust-lang/stdarch#1541.

@iximeow
Copy link
Contributor

iximeow commented Mar 9, 2024

i missed most of the existence of this issue, but thank you for also updating the documentation on {core,std}::intrinsics::nontemporal_store to point away from the intrinsic. i'd missed that at first and was going to suggest as much here, before finding that PR :)

i'd also like to, after the fact, thank folks here for concluding on the inline assembly implementation for _mm_stream_*, as someone who would use these. to go back to a very early comment here,

Also I think if we lower _mm_stream_ps in a way that includes an sfence, people will be very surprised. We should rather not expose _mm_stream_ps than expose something that's not the same operation.

to which i very strongly agree. i have a very specific expectation for _mm_stream_* and what nontemporal_store would mean on x86/x86_64 targets: "results in use of a movnt* instruction appropriate for the data being operated on". maintaining that is something of a relief :)

i realize the conversation here is focused mostly on the unsoundness of the existing implementation of these intrinsics, but i'd like to emphasize as a user of these instructions that their primary value is for the microarchitectural hardware effect. it's not particularly useful to any use i've seen that these stores are write-combining and weakly ordered, or that corresponding non-temporal loads are also WC and weakly-ordered. in fact (opinion warning) it's something of an annoying x86/x86_64 implementation detail rather than an expected or desired behavior. the primary value of these intrinsics is to avoid pulling known-unlikely data into caches only because it happened to be needed once.

using write-combining semantics to achieve that effect is, in my understanding, mostly a convenience to reuse an existing x86 memory type, rather than invent a new type used only for a handful of instructions. so, i hope this underlines why simply implementing these intrinsics as regular loads an stores would be so surprising.

one other earlier comment i'd like to understand better, on a hypothetical alternative to having simply stabilized these intrinsics a while ago:

something ad-hoc and unprincipled like "after a nontemporal store, until the next sfence, any release operation (release or stronger write, release or stronger fence, thread spawn) is UB".

would it be correct to phrase this a little more precisely as "after a nontemporal store and until the next sfence, any concurrent access to the memory nontemporally stored to is UB"? this is an important distinction as where i've typically used non-temporal instructions is to write to a buffer that i may not ever read again from the Rust function doing writing in the first place. in those cases it's on me to ensure (sfence or other target-appropriate) ordering guarantees, but it would not result in UB in terms of the Rust AM?

of course, "simply integrate the Rust AM, cross-language requirements, and subtle target architecture semantics and make sure you got it all right" is a tall order and part of why i cannot, for example, Just Point To where i do or would use _mm_stream_*.

and one other thought on why these are useful intrinsics to have, re.

For what it's worth, the basic intended usage of movnti and movntdq, followed by sfence, is in fact basically as what was, at the time, equivalent to what is now basically encompassed by Enhanced REP MOVSB, which on CPUs with that feature means ...

Enhanced REP MOVSB, and the related fast-string feature, as far as i've read, do not guarantee use anything about cache lines related to the buffers that are operated on. in practice cache lines may not be filled, but this would be architecture-dependent in a way that movnt*/sfence are not. additionally, movnt*/sfence are also used to implement non-cache-polluting memset, which rep stosb + fast-string doesn't guarantee. if the Rust project could pressure Intel into committing to non-cache-polluting semantics for fast string operations from some architecture onwards, i and all 200 people who are particularly invested in this weird corner of x86 would be greatly appreciative 🙏

finally, a kind of meta thought on the entire notion of these instrinsics as someone who would use them: in 2022 i'd considered using nontemporal_store but also needed a nontemporal_load. asm!() was not yet stable, and so i used neither and wrote a larger routine as assembly in a separate .s and linked it in. revisiting that from the perspective of this issue in 2024 i:

  • would be warned away from nontemporal_store,
  • not realize that _mm_stream_*() exist, just like i did not know they were present in 2022,
  • know that i just want movnti or vmovntdqa, write an asm!() block anyway.

if i'd known to look at _mm_stream_*() i'd probably use them in 2024. to opine, the entire notion of vendor-defined intrinsics is a little bit of a mess; i know the instruction i want on the far end of all of this, and at all points i'm basically working backwards to discover the most stable way to get that instruction produced in the resulting program, so the value of a vendor-defined intrinsic to me is that in C i could use that rather than compiler-specific assembly directives. not the kind of issue i'd have here.. :)

that all said, rust-lang/stdarch#1534 and rust-lang/stdarch#1541 seem quite reasonable, so thanks again.

@RalfJung
Copy link
Member Author

i'd also like to, after the fact, thank folks here for concluding on the inline assembly implementation for mm_stream*, as someone who would use these.

I am not sure we have concluded yet. rust-lang/stdarch#1541 is still open and I honestly don't know if people generally think that this is a good idea or not.

I'd prefer if we could use an LLVM intrinsic like for most of the other vendor intrinsics, but I don't know if LLVM would even accept taking such an intrinsic given that they have this !nontemporal attribute already.

something ad-hoc and unprincipled like "after a nontemporal store, until the next sfence, any release operation (release or stronger write, release or stronger fence, thread spawn) is UB".

would it be correct to phrase this a little more precisely as "after a nontemporal store and until the next sfence, any concurrent access to the memory nontemporally stored to is UB"? this is an important distinction as where i've typically used non-temporal instructions is to write to a buffer that i may not ever read again from the Rust function doing writing in the first place. in those cases it's on me to ensure (sfence or other target-appropriate) ordering guarantees, but it would not result in UB in terms of the Rust AM?

What you were quoting was an earlier, different proposal. What you said is very different from the text you were quoting. So no it wouldn't be correct to view what you said as a rephrasing of what you quoted. When I wrote "release operations are UB" I meant exactly that, I don't see how what you are saying is even similar to what I said.

But anyway please let's not discuss an outdated proposal that has since been overhauled. The updated proposal is what landed in rust-lang/stdarch#1534.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness O-x86_32 Target: x86 processors, 32 bit (like i686-*) O-x86_64 Target: x86-64 processors (like x86_64-*) P-medium Medium priority T-lang Relevant to the language team, which will review and decide on the PR/issue. T-opsem Relevant to the opsem team
Projects
None yet
Development

No branches or pull requests