Thoughts on using a more efficient ordering in ArcStr::drop #16

thomcc · 2020-08-19T02:55:15Z

As-is, our ArcStr::clone and ArcStr::drop code is more or less taken verbatim from the stdlib's std::sync::Arc<T>'s implementation*. There are a lot of reasons for this, mainly the fact that it was, at one point, proven correct. However, ArcStr offers some meaningfully stronger guarantees which could perhaps allow us improve our implementation of Drop.

Right now Drop (for non-literal ArcStrs) is conceptually:

if self.count.fetch_sub(1, Ordering::Release) == 1 {
    // See footnote for a comment about this.
    let _ = self.count.load(Ordering::Acquire);
    // call non-inline function to compute layout and deallocate memory and such.
    unsafe { self.drop_slow() };
}

However, given that we're immutable, it might give you pause that we need that Ordering::Acquire, which only ever executes on this thread (so it's purpose can't be to ensure other threads see our changes that we just made in fetch_sub). This tracks with discussion in the bug report linked in the footnote, which describe it as essentially ensuring that the actual destructor sees the final view of the type, e.g. ensuring that our thread gets changes written by other threads. Well, we don't actually need those, since ArcStr is fully immutable.

In practice, it may also serve as an optimization barrier to ensure writes from drop_slow (e.g writes from inside the memory allocator's dealloc) aren't moved before self.count.fetch_sub. This would be pretty wild, given that there's a conditional, but I guess in theory nothing strictly prevents it...

Further, while https://en.cppreference.com/w/cpp/atomic/memory_order says that memory_order_release provides the following guarantee:

A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store.

It's not 100% clear to me which part of the C or C++ standard here that this is using to back this up. In practice, it does seem like a given, but most of Releases semantics seem to be defined around the interaction with Acquire. I can almost see it given the definition of release sequences (in conjunction with happens before / dependency on / etc), but it's hard for me to say anything that concretely. Anyway, this matters because if we can't guarantee it and don't have an Acquire, there's nothing preventing other threads from doing their decref, and then continuing to read from the ArcStr despite a remote thread being allowed to deallocate the str's memory.

Do I think we might be able to remove the Acquire fence? Well, kinda:

It feels pretty likely that cppreference's interpretation that reads on a thread cannot be moved after a fetch_sub(1, Release) on the same thread is correct. In practice it's impossible to implement Mutex<T> without a guarantee of this. (In order to change the code I would need more concrete evidence, and perhaps this ultimately is going to be the downfall, but... yeah).

Additionally, while it's true that without the Acquire fence there's no ordering individual threads have with each other, it is guaranteed that the fetch_subs are still atomic, and that at most one thread will ever reach 0, and be allowed entry into the conditional block, and be allowed to free the memory.

That said, neither of these are really strong guarantees under the memory model, and so I'd need something stronger to convince me. Hell, even with obvious mistakes in the ordering AFAICT loom happily accepts my code... So maybe I need to install coq or whatever theorem prover has a specification of the C memory model's atomics that I can easily use... That said, it seems likely to be a while before I can do something like this.

Anyway, this was a bit rambling, and I probably repeated myself, but I wanted to get it down somewhere so that I can later take a look and say "ah no, I already thought about that" and such.

* Okay, not verbatim for ArcStr::drop. We take a small modification to replace an explicit atomic::fence(Acquire) with by a .load(Acquire) of the refcount. The stdlib actually does this too, but only when running under thread-sanitizer. At one point there was a PR for this change (rust-lang/rust#41714), which AFAICT mainly didn't merge because the policy at the time was to only accept PRs like this with benchmarks showing improvements, which will be hard given that the distinction here matters mainly in cases of high contention, and is specific to how the cache is being used.

However, it's possibly worth noting that this means the variant we use, while (IMO) "obviously" equivalent to using a fence, is not actually identical to what was proven correct in RustBelt.

The text was updated successfully, but these errors were encountered:

thomcc · 2020-12-20T19:18:05Z

I think it's pretty conclusive that we can't do better here. Closing to avoid confusing anyone in the future.

thomcc closed this as completed Dec 20, 2020

mgeier mentioned this issue Mar 5, 2022

Use a single dynamic allocation (making RingBuffer a DST) mgeier/rtrb#75

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thoughts on using a more efficient ordering in ArcStr::drop #16

Thoughts on using a more efficient ordering in ArcStr::drop #16

thomcc commented Aug 19, 2020

thomcc commented Dec 20, 2020

Thoughts on using a more efficient ordering in ArcStr::drop #16

Thoughts on using a more efficient ordering in ArcStr::drop #16

Comments

thomcc commented Aug 19, 2020

thomcc commented Dec 20, 2020