Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upRFC: hint::black_box #2360
Conversation
sfackler
reviewed
Mar 12, 2018
text/0000-bench-utils.md Outdated
gnzlbg
force-pushed the
gnzlbg:black_box
branch
4 times, most recently
from
b3f4441
to
8a9ae3f
Mar 12, 2018
rkruppe
reviewed
Mar 12, 2018
text/0000-bench-utils.md Outdated
gnzlbg
force-pushed the
gnzlbg:black_box
branch
from
8a9ae3f
to
ec42fd0
Mar 12, 2018
rkruppe
reviewed
Mar 12, 2018
| pub fn clobber() -> (); | ||
| ``` | ||
|
|
||
| flushes all pending writes to memory. Memory managed by block scope objects must |
This comment has been minimized.
This comment has been minimized.
rkruppe
Mar 12, 2018
Member
This wording (flushing pending writes) makes me uncomfortable because it's remniscient of memory consistency models, including hardware ones, when this is just a single-threaded and compiler-level restriction. Actually, come to think of it, I'd like to know the difference between this and compiler_fence(SeqCst). I can't think of any off-hand.
This comment has been minimized.
This comment has been minimized.
gnzlbg
Mar 12, 2018
Author
Contributor
when this is just a single-threaded and compiler-level restriction
This is correct.
compiler_fence(SeqCst). I can't think of any off-hand.
I can't either, but for some reason they do generate different code: https://godbolt.org/g/G2UoZC
I'll give this some more thought.
This comment has been minimized.
This comment has been minimized.
nagisa
Mar 12, 2018
•
Contributor
The difference between asm! with a memory clobber and compiler_fence exists in the fact, that memory clobber requires compiler to actually reload the memory if they want to use it again (as memory is… clobbered – considered changed), whereas compiler_fence only enforces that memory accesses are not reordered and the compiler still may use the usual rules to figure that it needn’t to reload stuff.
This comment has been minimized.
This comment has been minimized.
gnzlbg
Mar 13, 2018
Author
Contributor
@nagisa the only thing that clobber should do is flush pending writes to memory. It doesn't need to require that the compiler reloads memory on reuse. Maybe the volatile asm! with memory clobber is not the best way to implement that.
This comment has been minimized.
This comment has been minimized.
rkruppe
Mar 13, 2018
•
Member
@nagisa Thank you. I was mislead by the fact that fences prohibit some load-store optimization to thinking they'd also impact things like store-load forwarding on the same address with no intervening writes.
@gnzlbg asm! with memory is simply assumed to read from and write to all memory and all its effects follow from that. However, to be precise, this does not mean any reloads are introduced after a clobbering inline asm, it just means that all the loads that is already there (of which are a lot given that every local and many temporaries are stack slots) can't be replaced with values loaded from the same address before the clobber.
If you want something less strong, you need to be precise. "Flushing loads writes" is not really something that makes intuitive sense at the compiler level (and it surely has no effect on runtime, for example caches or store buffers?).
This comment has been minimized.
This comment has been minimized.
gnzlbg
Mar 13, 2018
Author
Contributor
@rkruppe mem::clobber() should be assumed to read from all memory with effects, so that any pending memory stores must have completed before the clobber. All the loads that are already there in registers, temporary stack slots, etc, should not be invalidated by the clobber.
This comment has been minimized.
This comment has been minimized.
rkruppe
Mar 13, 2018
Member
Okay that is a coherent concept. I'm not sure off-hand how to best implement that in LLVM. (compiler_fence is probably not enough, since it permits dead store elimination if the memory location is known to not escape.)
This comment has been minimized.
This comment has been minimized.
rkruppe
Mar 13, 2018
Member
However, come to think of it, what's the difference between black_box(x) (which is currently stated to just force a write of x to memory) and let tmp = x; clobber(); (which writes x to the stack slot of tmp and then forces that store to be considered live)?
This comment has been minimized.
This comment has been minimized.
gnzlbg
Mar 13, 2018
•
Author
Contributor
That's a good question and this relates to what I meant with "block scope" . In
{
let tmp = x;
clobber();
}this {} block is the only part of the program that knows the address of tmp so no other code can actually read it without invoking undefined behavior. So in this case, clobber does not make the store of tmp live, because nothing outside of this scope is able to read from it, and this scope doesn't read from it but executes clobber instead (clobbering "memory" does not clobber temporaries AFAICT).
However, if one shares the address of tmp with the rest of the program:
{
let tmp = x;
black_box(&tmp);
clobber();
}then clobber will force the store of tmp to be considered live.
gnzlbg
referenced this pull request
Mar 12, 2018
Closed
Move test::black_box to std and stabilize it #1484
leodasvacas
reviewed
Mar 12, 2018
text/0000-bench-utils.md Outdated
Centril
added
the
T-libs
label
Mar 12, 2018
This comment has been minimized.
This comment has been minimized.
|
I was already concerned about the interaction with the memory model, and this comment basically confirmed my worst suspicions: There are extremely subtle interactions between the memory model and what these function can do. I'm starting to believe that we'll be unable to give any guarantes that are of any use to benchmark authors. But if this is the case, these functions become a matter of chasing the optimizer and benchmarking reliably requires checking the asm to see your work wasn't optimize out. That's clearly very unappealing, but has always been true of micro benchmarks, so maybe that's just unavoidable. This also raises the question of why this needs to be in std, if it's not compiler magic (like |
This comment has been minimized.
This comment has been minimized.
|
I'm temporarily closing this till we resolve the semantics of these functions in the memory model repo. |
gnzlbg
closed this
Mar 13, 2018
gnzlbg
referenced this pull request
Mar 13, 2018
Open
semantics of black_blox and clobber in the memory model #45
This comment has been minimized.
This comment has been minimized.
|
The issue in the memory model repo is: nikomatsakis/rust-memory-model#45 |
This comment has been minimized.
This comment has been minimized.
|
Update. The summary of the discussion on the memory model repo is that |
This comment has been minimized.
This comment has been minimized.
|
Any updates on opening the new RFC? |
This comment has been minimized.
This comment has been minimized.
|
I've updated the RFC with the discussion of the memory model repo. It now just proposes Other changes:
Open questions:
|
gnzlbg
reopened this
Aug 29, 2018
This comment has been minimized.
This comment has been minimized.
|
Unsure if we should be opening a new RFC or reopening this one, cc @rust-lang/libs |
rkruppe
reviewed
Aug 29, 2018
text/0000-bench-utils.md Outdated
RalfJung
reviewed
Aug 30, 2018
text/0000-bench-utils.md Outdated
RalfJung
reviewed
Aug 30, 2018
text/0000-bench-utils.md Outdated
This comment has been minimized.
This comment has been minimized.
|
Seems fine to me! I was worried you'd try to talk about which memory |
This comment has been minimized.
This comment has been minimized.
|
Maybe we should shrink this in scope even further and have two primitives?
This seems specific enough to benchmarks/performance profiling to be less of a liability, miri and the documentation can be clear about this being much unlike It also feels significantly better than You could even imagine having a Bonus points for integrating the input/output with a bench framework so users never even have to use these hints directly. |
This comment has been minimized.
This comment has been minimized.
comex
commented
Feb 14, 2019
|
@RalfJung I agree it's not a problem we have to solve for The Anyway, I don't mind @eddyb's specification; it seems potentially more user-friendly than the current version, which can be confusing. I still think it would also be fine to stabilize |
gnzlbg
added some commits
Feb 19, 2019
This comment has been minimized.
This comment has been minimized.
|
So I re-read the RFC, and the whole discussion, and made the following modifications:
@cramertj I am unsure whether this addresses your FCP concern. Some discussion has happened since it was made, and to the best of my understanding, your concern has two parts:
With regard to naming, I've added an unresolved question to settle on a good name before stabilization. We've argued that
The operational semantics of With regards to expanding the guarantees of It is unclear to me whether a single primitive could solve these problems, whether that primitive would also be useful for solving the problem that this RFC attempts to solve, and even if this was all possible, whether such a primitive should be called Maybe, once we have well-defined primitives for all these problems, we might be able to either deprecate |
This comment has been minimized.
This comment has been minimized.
|
@bheisler Thoughts on the API proposed in #2360 (comment)? |
This comment has been minimized.
This comment has been minimized.
|
@eddyb I have some questions and comments about the
What do you mean by "the argument computation can be optimized out"? Does this function require the result of the computation to be put into a place ? Or that putting the result of the computation into a place can be optimized out?
Note that About
So for this to be useful, the optimizer has to assume that this operation has side-effects. That is, we can't make it Does the optimizer need to assume that the side-effects might depend on their argument? I'd think so, since otherwise, the argument result would be unused, and the computation of the argument removed.
If
I think I might be reading this the other way around. Do you mean that
Also, this comes with the cost of having to teach two APIs for writing benchmarks, so I am not sure this would be worth it (but if |
This comment has been minimized.
This comment has been minimized.
|
I don't understand the point about "places".
And AFAIK the current I have my doubts we can have an LLVM-based implementation today of my semantics for |
This comment has been minimized.
This comment has been minimized.
No, benchmark frameworks should integrate this for you. We can also use the There is no reason to assume |
This comment has been minimized.
This comment has been minimized.
|
So @eddyb and me talked about this, and this comment summarizes that. @eddyb proposes a That is, the statement let mut v = Vec::with_capacity(1);
v.push(1);
let v = bench_input(v);
v[0];the bound check in However, because the To prevent that from happening, @eddyb proposes to add With that, one can change the snippet above to: let mut v = Vec::with_capacity(1);
v.push(1);
let v = bench_input(v);
bench_output(v[0]);to prevent the code from being removed. Note: Something like fn black_box<T>(x: T) -> T {
bench_output(&x);
bench_input(x)
}This is definitely finer grained than The pessimizations here would still be a best effort thing, e.g., one cannot rely on This also means that, while the intent is for |
This comment has been minimized.
This comment has been minimized.
bheisler
commented
Feb 19, 2019
Well, I'm not really qualified to give an opinion on whether this proposal makes sense from a specification or compiler standpoint. I guess my concern would be teachability. Even now, it's really hard to explain to benchmark authors what Saying that benchmark frameworks should integrate the black boxes and the benchmark author shouldn't have to think about it is nice, but things are not always so simple. Criterion.rs already automatically I can't see a strong reason to prefer I am surprised by the suggestion that
I can't think of any reason to ever call it in any other way - passing an owned value into
|
This comment has been minimized.
This comment has been minimized.
Oh I think I introduced that mistake. @eddyb's comment does say that it is equivalent to
Often, one just want to inhibit optimizations that are based on the value that a variable has, and |
This comment has been minimized.
This comment has been minimized.
Oh, absolutely. :) I hope to have a look at how CompCert handles syscalls, which could give us some nice ideas about how to formalize "unknown function calls".
Note that Miri's behavior is entirely unaffected by whether a function is
To avoid the
How does Criterion's API for this look like? I've been imagining something like fn iter<I: Clone, O>(&mut self, i: &I, f: impl Fn(I) -> O) {
for /* a number of iterations */ {
let i = bench_input(i.clone());
let o = self.measure_the_time(|| f(i));
bench_output(&o);
}
}Then it is up to the user to decide whether they capture inputs in the closure or pass them in explicitly via So, from a user's perspective: let input1 = prepare_input1();
let input2 = prepare_input2();
let input3 = prepare_input3();
bencher.iter(&(input1, input2), |(input1, input2)| {
// now input1 and input2 have been passed through the black_box,
// but input3 has not.
run(input1, input2, input3)
});(This also nicely keeps the |
This comment has been minimized.
This comment has been minimized.
bheisler
commented
Feb 20, 2019
•
That's more-or-less correct. There's some additional complexity to reduce the overhead and measurement error from calling At a higher level, the user interacts with it like this: let input_2 = unimplemented!();
c.bench(
"GroupName",
ParameterizedBenchmark::new("function_1", |b, input_1| b.iter(|| function_1(*input_1, input_2),
vec![20u64, 21u64],
)
.with_function("function_2", |b, input_1| b.iter(|| function_2(*input_1, input_2)),
);I should note that there are many things I don't like about this API, though, so I've been thinking about changing it. In this case, the |
This comment has been minimized.
This comment has been minimized.
|
So with a lot of help from @Amanieu (thanks!!), this is how the example form the RFC looks with fn push_cap(v: &mut Vec<i32>) {
let v: &mut _ = bench_input(v);
for i in 0..4 {
bench_output(v.as_ptr());
v.push(bench_input(i));
bench_output(v.as_ptr());
}
}vs before: fn push_cap(v: &mut Vec<i32>) {
let v: &mut _ = black_box(v);
for i in 0..4 {
black_box(v.as_ptr());
v.push(black_box(i));
black_box(v.as_ptr());
}
}Implementation-wise, we can express @eddyb 's semantics using inline assembly just fine. From the POV of codegen, the difference is small for this particular example, but one does observe slightly better codegen for @eddyb 's version, which might be caused from the more finer-grained semantics of input and output. Implementing |
This comment has been minimized.
This comment has been minimized.
|
@gnzlbg that's strange, why do you |
This comment has been minimized.
This comment has been minimized.
The vector is both the input and the output here.
#[inline(always)] // #[readnone]
fn bench_input<T>(x: T) -> T {
// Reads the memory in x and modifies it
unsafe { asm!("" : "+*m"(&x) : : :); }
x
}has no side-effects and it reads/writes to the memory of
#[inline(always)]
fn bench_output<T>(x: &T) -> () {
// this reads the operand, reads all memory, and has a side-effect
unsafe { asm!("" : : "*m"(x) : "memory" : "volatile"); }
}has side-effects that depend on the memory of If the user wants to measure time it takes to push elements, including the time it takes to write the elements to the vector memory, it needs to force those writes to memory to happen. We could say that #[inline(always)]
fn bench_input<T>(x: T) -> T {
// Reads the memory in x and modifies it
unsafe { asm!("" : "+*m"(&x) : : "memory" :); }
x
}which means that in practice it wouldn't be On paper, that would mean that In practice, both Given the implementation constraints, in practice, if we pursue this approach, we'd end up with two ways of spelling the So AFAICT for |
This comment has been minimized.
This comment has been minimized.
Sure,but you are making it the output on the input side. From an API perspective, that does not make any sense.
Again from an API perspective, I find that weird. I'd think that a |
This comment has been minimized.
This comment has been minimized.
We can definitely specify this, I just don't know how to implement this. |
This comment has been minimized.
This comment has been minimized.
Hm... I know nothing about these asm annotations, so this might make no sense, but is it possible to have two asm blocks, one that may read any memory transitively reachable from |
This comment has been minimized.
This comment has been minimized.
comex
commented
Feb 23, 2019
|
From LLVM's perspective, inline assembly invocations are actually call instructions, and thus have an associated // Attach readnone and readonly attributes.
if (!HasSideEffect) {
if (ReadNone)
Result->addAttribute(llvm::AttributeList::FunctionIndex,
llvm::Attribute::ReadNone);
else if (ReadOnly)
Result->addAttribute(llvm::AttributeList::FunctionIndex,
llvm::Attribute::ReadOnly);
}
The attributes seem to be the only mechanism by which LLVM determines the possible side effects of Currently, rustc never attaches This should be fixed to match Clang. In that case, I think ...That said, I'm not convinced that such a subtle distinction is really worth providing a separate API for. In most cases it won't make a difference, and when it does, i.e. when optimizations are performed due to the use of |
gnzlbg
referenced this pull request
Feb 23, 2019
Closed
Deprecate Read::initializer in favor of ptr::freeze #58363
This comment has been minimized.
This comment has been minimized.
|
@comex Keep in mind this "subtle distinction" also makes these functions very benchmark-centric and easily disavowable as "not for working around UB", which was my initial intention. Also the different signatures should prevent most misuse IMO. |
gnzlbg commentedMar 12, 2018
•
edited
Adds
black_boxtocore::hint.Rendered.