Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: hint::black_box #2360

Open
wants to merge 20 commits into
base: master
from

Conversation

Projects
None yet
@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 12, 2018

Adds black_box to core::hint.

Rendered.

Show resolved Hide resolved text/0000-bench-utils.md Outdated

@gnzlbg gnzlbg force-pushed the gnzlbg:black_box branch 4 times, most recently from b3f4441 to 8a9ae3f Mar 12, 2018

Show resolved Hide resolved text/0000-bench-utils.md Outdated

@gnzlbg gnzlbg force-pushed the gnzlbg:black_box branch from 8a9ae3f to ec42fd0 Mar 12, 2018

pub fn clobber() -> ();
```

flushes all pending writes to memory. Memory managed by block scope objects must

This comment has been minimized.

@rkruppe

rkruppe Mar 12, 2018

Member

This wording (flushing pending writes) makes me uncomfortable because it's remniscient of memory consistency models, including hardware ones, when this is just a single-threaded and compiler-level restriction. Actually, come to think of it, I'd like to know the difference between this and compiler_fence(SeqCst). I can't think of any off-hand.

This comment has been minimized.

@gnzlbg

gnzlbg Mar 12, 2018

Author Contributor

when this is just a single-threaded and compiler-level restriction

This is correct.

compiler_fence(SeqCst). I can't think of any off-hand.

I can't either, but for some reason they do generate different code: https://godbolt.org/g/G2UoZC

I'll give this some more thought.

This comment has been minimized.

@nagisa

nagisa Mar 12, 2018

Contributor

The difference between asm! with a memory clobber and compiler_fence exists in the fact, that memory clobber requires compiler to actually reload the memory if they want to use it again (as memory is… clobbered – considered changed), whereas compiler_fence only enforces that memory accesses are not reordered and the compiler still may use the usual rules to figure that it needn’t to reload stuff.

This comment has been minimized.

@gnzlbg

gnzlbg Mar 13, 2018

Author Contributor

@nagisa the only thing that clobber should do is flush pending writes to memory. It doesn't need to require that the compiler reloads memory on reuse. Maybe the volatile asm! with memory clobber is not the best way to implement that.

This comment has been minimized.

@rkruppe

rkruppe Mar 13, 2018

Member

@nagisa Thank you. I was mislead by the fact that fences prohibit some load-store optimization to thinking they'd also impact things like store-load forwarding on the same address with no intervening writes.

@gnzlbg asm! with memory is simply assumed to read from and write to all memory and all its effects follow from that. However, to be precise, this does not mean any reloads are introduced after a clobbering inline asm, it just means that all the loads that is already there (of which are a lot given that every local and many temporaries are stack slots) can't be replaced with values loaded from the same address before the clobber.

If you want something less strong, you need to be precise. "Flushing loads writes" is not really something that makes intuitive sense at the compiler level (and it surely has no effect on runtime, for example caches or store buffers?).

This comment has been minimized.

@gnzlbg

gnzlbg Mar 13, 2018

Author Contributor

@rkruppe mem::clobber() should be assumed to read from all memory with effects, so that any pending memory stores must have completed before the clobber. All the loads that are already there in registers, temporary stack slots, etc, should not be invalidated by the clobber.

This comment has been minimized.

@rkruppe

rkruppe Mar 13, 2018

Member

Okay that is a coherent concept. I'm not sure off-hand how to best implement that in LLVM. (compiler_fence is probably not enough, since it permits dead store elimination if the memory location is known to not escape.)

This comment has been minimized.

@rkruppe

rkruppe Mar 13, 2018

Member

However, come to think of it, what's the difference between black_box(x) (which is currently stated to just force a write of x to memory) and let tmp = x; clobber(); (which writes x to the stack slot of tmp and then forces that store to be considered live)?

This comment has been minimized.

@gnzlbg

gnzlbg Mar 13, 2018

Author Contributor

That's a good question and this relates to what I meant with "block scope" . In

{ 
    let tmp = x; 
    clobber();
}

this {} block is the only part of the program that knows the address of tmp so no other code can actually read it without invoking undefined behavior. So in this case, clobber does not make the store of tmp live, because nothing outside of this scope is able to read from it, and this scope doesn't read from it but executes clobber instead (clobbering "memory" does not clobber temporaries AFAICT).

However, if one shares the address of tmp with the rest of the program:

{ 
    let tmp = x; 
    black_box(&tmp);
    clobber();
}

then clobber will force the store of tmp to be considered live.

Show resolved Hide resolved text/0000-bench-utils.md Outdated

@Centril Centril added the T-libs label Mar 12, 2018

@rkruppe

This comment has been minimized.

Copy link
Member

rkruppe commented Mar 13, 2018

I was already concerned about the interaction with the memory model, and this comment basically confirmed my worst suspicions: There are extremely subtle interactions between the memory model and what these function can do.

I'm starting to believe that we'll be unable to give any guarantes that are of any use to benchmark authors. But if this is the case, these functions become a matter of chasing the optimizer and benchmarking reliably requires checking the asm to see your work wasn't optimize out. That's clearly very unappealing, but has always been true of micro benchmarks, so maybe that's just unavoidable.

This also raises the question of why this needs to be in std, if it's not compiler magic (like compiler_fence is) but a natural consequence of the memory model and inline asm. Aside from the stability of inline asm, which is not a very convincing argument to me.

@gnzlbg

This comment has been minimized.

Copy link
Contributor Author

gnzlbg commented Mar 13, 2018

I'm temporarily closing this till we resolve the semantics of these functions in the memory model repo.

@gnzlbg

This comment has been minimized.

Copy link
Contributor Author

gnzlbg commented Mar 13, 2018

The issue in the memory model repo is: nikomatsakis/rust-memory-model#45

@gnzlbg

This comment has been minimized.

Copy link
Contributor Author

gnzlbg commented Jun 11, 2018

Update. The summary of the discussion on the memory model repo is that clobber should be scrapped, and the RFC updated with the definition of black box agreed on there.

@Manishearth

This comment has been minimized.

Copy link
Member

Manishearth commented Aug 28, 2018

Any updates on opening the new RFC?

@gnzlbg

This comment has been minimized.

Copy link
Contributor Author

gnzlbg commented Aug 29, 2018

I've updated the RFC with the discussion of the memory model repo.

It now just proposes pub unsafe fn core::hint::black_box<T>(x: T) -> T, which is specified as an unknown unsafe function that returns x. It is a no-op in the virtual machine, but the compiler has to assume that it can perform any valid operation on x that unsafe Rust code is allowed to perform.

Other changes:

  • moved the function from core::mem to core::hint since that appears to be a more suitable place for this functionality which is a no-op.

Open questions:

  • do we need a "safe" alternative? What safe and unsafe Rust are allowed to do differs, e.g., if the user passes it a raw pointer unsafe Rust is allowed to dereference it while safe Rust is not.

  • In the memory model repo discussion it was a bit unclear (to me) whether this function should take a reference or not. I've keep the interface from the test crate which takes a value, which allows users to pass it a &/&mut/*mut... or whatever they want, but I am not sure this is the right call.

  • should we call this black_box or something else ? (e.g. unknown)

cc @RalfJung @ubsan @rkruppe

@gnzlbg gnzlbg reopened this Aug 29, 2018

@Manishearth

This comment has been minimized.

Copy link
Member

Manishearth commented Aug 29, 2018

Unsure if we should be opening a new RFC or reopening this one, cc @rust-lang/libs

Show resolved Hide resolved text/0000-bench-utils.md Outdated
Show resolved Hide resolved text/0000-bench-utils.md Outdated
Show resolved Hide resolved text/0000-bench-utils.md Outdated
@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Aug 30, 2018

Seems fine to me! I was worried you'd try to talk about which memory black_box could touch, but you actually are using the spec I was hoping for. :)

@eddyb

This comment has been minimized.

Copy link
Member

eddyb commented Feb 14, 2019

Maybe we should shrink this in scope even further and have two primitives?

  • bench_obscure (or bench_input)
    • fn(T) -> T (identity-like)
    • may prevent some optimizations from seeing through the valid T value
      • more specifically, things like const/load-folding and range-analysis
    • miri would still check the argument, and so it couldn't be e.g. uninitialized
    • the argument computation can be optimized-out (unlike bench_used)
    • mostly implementable today with the same strategy as black_box
  • bench_used (or bench_output)
    • fn(T) -> () (drop-like)
    • may prevent some optimizations from optimizing out the computation of its argument
    • the argument is not treated as "escaping into unknown code"
      • i.e. you can't implement bench_obscure(x) as { bench_used(&mut x); x }, what that would likely prevent is placing x into a register instead of memory, but optimizations might still see the old value of x, as if it couldn't have been mutated
    • potentially implementable like black_box but readonly/readnone

This seems specific enough to benchmarks/performance profiling to be less of a liability, miri and the documentation can be clear about this being much unlike freeze, and more things can be optimized.
(and it certainly doesn't need to depend on the semantics of asm!)

It also feels significantly better than black_box in that we're not reusing a primitive for two subtly distinct sides ("input" and "output") of a computation that we want to measure performance characteristics of, with subtly different (de)optimization needs.

You could even imagine having a hint::bench(T, impl FnOnce(T) -> U) -> U abstraction which handles the asymmetry while requiring less manual control.

Bonus points for integrating the input/output with a bench framework so users never even have to use these hints directly.

@comex

This comment has been minimized.

Copy link

comex commented Feb 14, 2019

@RalfJung I agree it's not a problem we have to solve for black_box, so I'm going to resist the temptation to reply, to avoid cluttering this thread more than I already have. But, since stabilizing asm! is also a goal, I'd be interested in talking more about it somewhere else. :)

The freeze example is interesting. It seems like the opposition is along very different lines from what this thread has discussed: not that it's hard to specify, or that future implementations might have trouble defining or upholding the semantics, but basically "there is no good reason to ever use this". I suppose you might make the same argument about the implicit freezing that black_box can do, but overall it seems like a fairly different debate.

Anyway, I don't mind @eddyb's specification; it seems potentially more user-friendly than the current version, which can be confusing. I still think it would also be fine to stabilize black_box as is, though.

@gnzlbg

This comment has been minimized.

Copy link
Contributor Author

gnzlbg commented Feb 19, 2019

So I re-read the RFC, and the whole discussion, and made the following modifications:

  • Put the primary use case (benchmarking) up front.

  • Add further clarification of what black_box being a const fn could imply (e.g. no side-effects, which clearly isn't the intent here, i.e., we don't want the compiler to be able to assume that black_box has no side-effects).

  • I added an unresolved question about naming, with some pros/const, alternatives, etc.


@cramertj I am unsure whether this addresses your FCP concern. Some discussion has happened since it was made, and to the best of my understanding, your concern has two parts:

  • black_box does not have an appropriate name,
  • black_box should solve more problems.

With regard to naming, I've added an unresolved question to settle on a good name before stabilization. We've argued that black_box was a good name because of its meaning in a CS context (see #2360 (comment)), but it seems that this meaning might not be widely used as @cramertj mentioned:

I'm strongly opposed to adding a function to the language called black_box that doesn't guarantee the expected semantics of a black_box function in other languages and systems.

The operational semantics of black_box in other programming languages and systems should be explored before stabilization, and that might hopefully guide us to a better name.

With regards to expanding the guarantees of black_box so that it can be used to solve more problems, like freeze and reading from memory modified by a different process, from my point-of-view, solving these problems is about adding ways to turn undefined behavior into defined behavior, e.g., by allowing use of uninitialized memory (freeze) or by allowing reading volatile memory without using a volatile read.

It is unclear to me whether a single primitive could solve these problems, whether that primitive would also be useful for solving the problem that this RFC attempts to solve, and even if this was all possible, whether such a primitive should be called black_box. It appears to me that finer grained primitives like freeze, which solve these problems in a finer grained way are the way forward to tackle these problems.

Maybe, once we have well-defined primitives for all these problems, we might be able to either deprecate black_box, or implement it on top of them, maybe even changing its behavior to be reliable. Right now, we are not there yet, and freeze won't be enough for that.

@Centril

This comment has been minimized.

Copy link
Contributor

Centril commented Feb 19, 2019

@bheisler Thoughts on the API proposed in #2360 (comment)?

@gnzlbg

This comment has been minimized.

Copy link
Contributor Author

gnzlbg commented Feb 19, 2019

@eddyb I have some questions and comments about the bench_obscure and bench_used proposal.


bench_obscure (or bench_input)

  • fn(T) -> T (identity-like)

  • may prevent some optimizations from seeing through the valid T value

    • more specifically, things like const/load-folding and range-analysis
  • miri would still check the argument, and so it couldn't be e.g. uninitialized

  • the argument computation can be optimized-out (unlike bench_used)

  • mostly implementable today with the same strategy as black_box

What do you mean by "the argument computation can be optimized out"? Does this function require the result of the computation to be put into a place ? Or that putting the result of the computation into a place can be optimized out?

black_box allows the result of the computation to be optimized out, but requires it to be put into a place (that's what the wording about materializing into memory means), so if that's what you mean then this appears to be identical to black_box.

Note that black_box, bench_used, and bench_obscure are all safe Rust functions, so miri needs to check proper argument initialization for all of them.


About bench_used:

bench_used (or bench_output)

  • fn(T) -> () (drop-like)

  • may prevent some optimizations from optimizing out the computation of its argument

  • the argument is not treated as "escaping into unknown code"

    • i.e. you can't implement bench_obscure(x) as { bench_used(&mut x); x }, what that would likely prevent is placing x into a register instead of memory, but optimizations might still see the old value of x, as if it couldn't have been mutated
  • potentially implementable like black_box but readonly/readnone

readonly or readnone functions returning nothing have no effect and can be removed, which means that the result of the computation producing the T would be unused, and would potentially be removed as well.

So for this to be useful, the optimizer has to assume that this operation has side-effects. That is, we can't make it readonly, readnone, or const fn.

Does the optimizer need to assume that the side-effects might depend on their argument?

I'd think so, since otherwise, the argument result would be unused, and the computation of the argument removed.

  • the argument is not treated as "escaping into unknown code"

If bench_used has side-effects that depend on their argument, then I am unsure what that might mean. EDIT: as in, those side-effects are escaping something about the argument, and I'm unsure where the line would be for what the optimizer is allowed to assume about what that is.

what that would likely prevent is placing x into a register instead of memory,

I think I might be reading this the other way around. Do you mean thatbench_used would require the argument to be computed and put somewhere, but that wouldn't necessary need to be memory, but could be a register instead?

I would find that useful, since that does add some value over black_box. EDIT: Note that black_box does not require the result to be materialized "somewhere", not necessary the stack, could also be a register. However, if the timing difference between putting something in a register or in the stack is relevant for a benchmark, chances are that no API we provide is going to work for benchmarking that application, since e.g. CPU μOps optimizations like move elimination are going to influence your result, and those will trigger here if you write to a register but do not use the result (this could happen with black_box as well though and is more a general limitation).

Also, this comes with the cost of having to teach two APIs for writing benchmarks, so I am not sure this would be worth it (but if bench_obscure is equivalent to black_box, this second API can always be added later if deemed necessary).

@eddyb

This comment has been minimized.

Copy link
Member

eddyb commented Feb 19, 2019

I don't understand the point about "places".

black_box(x) today is { anything_could_happen(&mut x); x } where anything_could_happen has the "escape input to unknown side-effects" semantics (which is like bench_used but stronger) and is actually a bit of inline asm!.

And AFAIK the current black_box (as in, the one friendly to benchmarks) cannot be optimized out, as it has to handle both benchmark input and output.

I have my doubts we can have an LLVM-based implementation today of my semantics for bench_{obscure,used} that is actually weaker than black_box, but we can still keep the spec weaker in the meanwhile.

@eddyb

This comment has been minimized.

Copy link
Member

eddyb commented Feb 19, 2019

Also, this comes with the cost of having to teach two APIs for writing benchmarks

No, benchmark frameworks should integrate this for you.

We can also use the bench_{input,output} names which should make it obvious what to do.

There is no reason to assume black_box is intuitive today.

@gnzlbg

This comment has been minimized.

Copy link
Contributor Author

gnzlbg commented Feb 19, 2019

So @eddyb and me talked about this, and this comment summarizes that.

@eddyb proposes a bench_input function with the operational semantics of the identity function, that the optimizer is encouraged / hinted to treat as an unknown pure function (think extern { const fn bench_input<T>(x: T) -> T; }).

That is, the statement bench_input(expr); can be completely optimized away, and given:

let mut v = Vec::with_capacity(1);
v.push(1);
let v = bench_input(v); 
v[0];

the bound check in v[0] cannot be removed because bench_input could return any vector (e.g. Vec::new()). Also, because the return value depends on the input, the write of 1 to memory in the push must be flushed before calling bench_input.

However, because the v[0]; result is not used, that could be optimized away, and because v is not used, and bench_input is pure, actually the whole snippet can be optimized away.

To prevent that from happening, @eddyb proposes to add bench_output(x), which has the same operational semantics as mem::forget: it leaks its value, runs no destructors, but which the optimizer is encouraged to treat as an unknown function with read-only side-effects that depend on x.

With that, one can change the snippet above to:

let mut v = Vec::with_capacity(1);
v.push(1);
let v = bench_input(v); 
bench_output(v[0]);

to prevent the code from being removed.

Note: bench_output cannot be implemented as fn bench_output<T>(x: T) { bench_input(x); }, because that would mean that it can be removed. It must therefore be a special API.

Something like black_box, as proposed in this RFC, would then be implemented as:

fn black_box<T>(x: T) -> T {
    bench_output(&x);
    bench_input(x)
}

This is definitely finer grained than black_box as proposed in the RFC. I like how thinking about bench_input as a "foreign const fn" makes it easy for me to have an idea about which optimizations are / aren't inhibited by it. I find reasoning about bench_output a bit harder, but not as hard as reasoning about black_box, which conflates both.

The pessimizations here would still be a best effort thing, e.g., one cannot rely on bench_input to inhibit constant propagation portably.

This also means that, while the intent is for bench_output and bench_input to inhibit different types of optimizations, in practice, they might inhibit more optimizations than the ones they aim to inhibit, depending on the implementation (e.g. one could implement both as black_box).

@bheisler

This comment has been minimized.

Copy link

bheisler commented Feb 19, 2019

@bheisler Thoughts on the API proposed in #2360 (comment)?

Well, I'm not really qualified to give an opinion on whether this proposal makes sense from a specification or compiler standpoint.

I guess my concern would be teachability. Even now, it's really hard to explain to benchmark authors what black_box does and if/when they should add black_box calls on top of what Criterion.rs already does. In some sense, this is exacerbated by the vagueness of what it does. The best I can really do is advise users to black_box inputs if they expect the optimizer won't be able to optimize that input away in real usage and let Criterion.rs black_box outputs. Is that advice correct? I have no idea, and I suppose in some sense it doesn't matter. Benchmarks are always an imperfect model of the real system, so some bias is to be expected. I suspect many benchmark authors have measurements which are more biased than necessary due to incorrect usage of black_box though.

Saying that benchmark frameworks should integrate the black boxes and the benchmark author shouldn't have to think about it is nice, but things are not always so simple. Criterion.rs already automatically black_boxes the input and output to the benchmark routine, but it's not out of the question for a benchmark to want to black_box some inputs (eg. ones that they know will be generated from user input in the real system) and not others (eg. values that are expected to be constants in the real system). Furthermore, Criterion.rs can only black_box the one parameter that the user has passed to it - if the system under measurement takes multiple parameters the user must explicitly black_box those (though it could be argued this is a design flaw in Criterion.rs, I suppose). At least some benchmark authors will have to be aware of these issues.

I can't see a strong reason to prefer black_box over bench_input/output or vice versa - bench_input/output gives a clearer understanding of what they do but increases the API surface area and thus the potential for confusion. I would prefer to avoid bench_used/obscure since they're very non-intuitive names.

I am surprised by the suggestion that bench_output should not run destructors. That means that I always have to call it like so:

let value : T = ...; // where T could potentially be Drop
bench_output(&value);
drop(value); // Or just let the compiler do this

I can't think of any reason to ever call it in any other way - passing an owned value into bench_output will potentially leak resources, and it's name does not indicate that it will do so (unlike mem::forget). That seems like an unneccesary footgun. It could be changed to always take an &T to prevent misuse, or it could be equivalent to mem::drop instead. I can't speak to how this would affect the deoptimization behavior though.

bench_input is described as an unknown pure function. I'm not sure I understand the nuances of why it should be pure, rather than just being an unknown function with unknown side effects. Could you elaborate on that?

@gnzlbg

This comment has been minimized.

Copy link
Contributor Author

gnzlbg commented Feb 19, 2019

am surprised by the suggestion that bench_output should not run destructors. That means that I always have to call it like so:

Oh I think I introduced that mistake. @eddyb's comment does say that it is equivalent to drop. AFAICT if one does not want the value to be dropped, they can just call bench_output(&v), so dropping seems like the natural thing to do.

bench_input is described as an unknown pure function. I'm not sure I understand the nuances of why it should be pure, rather than just being an unknown function with unknown side effects.

Often, one just want to inhibit optimizations that are based on the value that a variable has, and bench_input does that. If it had unknown side-effects, it can potentially read / write all memory it can access, so the compiler would need to be much more pessimistic about that, reloading globals after calling it, etc.

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Feb 20, 2019

But, since stabilizing asm! is also a goal, I'd be interested in talking more about it somewhere else. :)

Oh, absolutely. :) I hope to have a look at how CompCert handles syscalls, which could give us some nice ideas about how to formalize "unknown function calls".

Note that black_box, bench_used, and bench_obscure are all safe Rust functions, so miri needs to check proper argument initialization for all of them.

Note that Miri's behavior is entirely unaffected by whether a function is unsafe or not.

To prevent that from happening, @eddyb proposes to add bench_output(x), which has the same operational semantics as mem::forget: it leaks its value, runs no destructors, but which the optimizer is encouraged to treat as an unknown function with read-only side-effects that depend on x.

To avoid the drop discussion, maybe the signature should be fn bench_output<T>(x: &T). That makes more sense for a read-only function anyway.

Criterion.rs already automatically black_boxes the input and output to the benchmark routine, but it's not out of the question for a benchmark to want to black_box some inputs (eg. ones that they know will be generated from user input in the real system) and not others (eg. values that are expected to be constants in the real system).

How does Criterion's API for this look like? I've been imagining something like

fn iter<I: Clone, O>(&mut self, i: &I, f: impl Fn(I) -> O)  {
  for /* a number of iterations */ {
    let i = bench_input(i.clone());
    let o = self.measure_the_time(|| f(i));
    bench_output(&o);
  }
}

Then it is up to the user to decide whether they capture inputs in the closure or pass them in explicitly via i -- and the part in i will be "black_box-ed". I can be a tuple type if multiple parameters need to be "obscured".

So, from a user's perspective:

let input1 = prepare_input1();
let input2 = prepare_input2();
let input3 = prepare_input3();
bencher.iter(&(input1, input2), |(input1, input2)| {
  // now input1 and input2 have been passed through the black_box,
  // but input3 has not.
  run(input1, input2, input3)
});

(This also nicely keeps the clone cost out of the measurement, which is something that has bothered me about the default benchmarking API.)

@bheisler

This comment has been minimized.

Copy link

bheisler commented Feb 20, 2019

How does Criterion's API for this look like? I've been imagining something like

fn iter<I: Clone, O>(&mut self, i: &I, f: impl Fn(I) -> O)  {
  for /* a number of iterations */ {
    let i = bench_input(i.clone());
    let o = self.measure_the_time(|| f(i));
    bench_output(&o);
  }
}

Then it is up to the user to decide whether they capture inputs in the closure or pass them in explicitly via i -- and the part in i will be "black_box-ed". I can be a tuple type if multiple parameters need to be "obscured".

That's more-or-less correct. There's some additional complexity to reduce the overhead and measurement error from calling Instant::now many times (among other things) but at the core it's a loop very much like that.

At a higher level, the user interacts with it like this:

let input_2 = unimplemented!();
c.bench(
    "GroupName",
    ParameterizedBenchmark::new("function_1", |b, input_1| b.iter(|| function_1(*input_1, input_2),
        vec![20u64, 21u64],
    )
    .with_function("function_2", |b, input_1| b.iter(|| function_2(*input_1, input_2)),
);

I should note that there are many things I don't like about this API, though, so I've been thinking about changing it.

In this case, the input_1 would be passed through criterion::black_box automatically, as would the output, but not input_2. criterion::black_box simply delegates to test::black_box if it's available (if a cargo feature is enabled) or uses a volatile read of the input as a fake black_box if it isn't.

@gnzlbg

This comment has been minimized.

Copy link
Contributor Author

gnzlbg commented Feb 22, 2019

So with a lot of help from @Amanieu (thanks!!), this is how the example form the RFC looks with bench_input and bench_output (godbolt diff):

fn push_cap(v: &mut Vec<i32>) {
    let v: &mut _ = bench_input(v);
    for i in 0..4 {
      bench_output(v.as_ptr());
      v.push(bench_input(i));
      bench_output(v.as_ptr());
    }
}

vs before:

fn push_cap(v: &mut Vec<i32>) {
    let v: &mut _ = black_box(v);
    for i in 0..4 {
      black_box(v.as_ptr());
      v.push(black_box(i));
      black_box(v.as_ptr());
    }
}

Implementation-wise, we can express @eddyb 's semantics using inline assembly just fine. From the POV of codegen, the difference is small for this particular example, but one does observe slightly better codegen for @eddyb 's version, which might be caused from the more finer-grained semantics of input and output.

Implementing black_box on top of bench_input and bench_output does not produce the same results as in the black_box version proposed by the RFC (godbolt diff) nor the one currently implemented in Rust (godbolt diff) but the semantics are different, and the aim of this example was to explain how to use the feature and show large codegen differences (all code removed), not differences in two slightly different versions of the same thing.

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Feb 22, 2019

@gnzlbg that's strange, why do you bench_output the input?

@gnzlbg

This comment has been minimized.

Copy link
Contributor Author

gnzlbg commented Feb 22, 2019

why do you bench_output the input?

The vector is both the input and the output here.

bench_input:

#[inline(always)] // #[readnone] 
fn bench_input<T>(x: T) -> T { 
  // Reads the memory in x and modifies it
  unsafe { asm!("" : "+*m"(&x) : : :); } 
  x
}

has no side-effects and it reads/writes to the memory of x, but it does not read/write through pointers that x might have (at least not in this implementation, I don't know exactly whether this was @eddyb 's intent), so it does not force pending writes to the vector memory to be flushed because it does not touch that memory.

bench_output:

#[inline(always)]
fn bench_output<T>(x: &T) -> () { 
  // this reads the operand, reads all memory, and has a side-effect
  unsafe { asm!("" : : "*m"(x) : "memory" : "volatile"); } 
}

has side-effects that depend on the memory of x and all memory reachable from x. We can't really express that in LLVM, so the implementation currently clobbers all memory (basically it reads / writes to all memory that could be written to, e.g., not private locals).

If the user wants to measure time it takes to push elements, including the time it takes to write the elements to the vector memory, it needs to force those writes to memory to happen. bench_input, at least as implemented, does not achieve this (it has no side-effects, it does not read through the vector pointers, it is readnone). So one needs to insert a bench_output somewhere here.


We could say that bench_input reads through the vector pointers, and that this influences its result. But then it's specification would be readonly and not readnone. The only way I know of implementing this would be like this:

#[inline(always)]
fn bench_input<T>(x: T) -> T { 
  // Reads the memory in x and modifies it
  unsafe { asm!("" : "+*m"(&x) : : "memory" :); } 
  x
}

which means that in practice it wouldn't be readonly because it writes to all memory (as mentioned I don't know how to write a read-only memory barrier in LLVM - we could make it readonly with something like this).

On paper, that would mean that bench_input creates a new value from an existing one, and from all memory, while bench_output reads the value, reads all memory, and has some other side-effect (otherwise it can be removed).

In practice, both bench_input and bench_output would read the value, and read/write to all memory, so both would have side-effects.

Given the implementation constraints, in practice, if we pursue this approach, we'd end up with two ways of spelling the black_box proposed in this RFC. While users should use them appropriately, there is no reward for using them correctly, and technically, we are not guaranteeing anything either, this is just all a quality-of-implementation / best effort issue.

So AFAICT for bench_input and bench_output to add practical value over black_box, bench_input must be readnone (no side-effects, read only the memory of x and nothing outside that). This means that one might have to use bench_output to flush writes to memory every now and then. If we remove that constraint, we re-invent black_box, and then there is no need to have two functions.

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Feb 22, 2019

The vector is both the input and the output here.

Sure,but you are making it the output on the input side. From an API perspective, that does not make any sense.

bench_input must be readnone (no side-effects, read only the memory of x and nothing outside that).

Again from an API perspective, I find that weird. I'd think that a Vec passed to bench_input should be read in its entirety, including the transitively reachable part. It can still be readonly because it must not write to anything.

@gnzlbg

This comment has been minimized.

Copy link
Contributor Author

gnzlbg commented Feb 22, 2019

Again from an API perspective, I find that weird. I'd think that a Vec passed to bench_input should be read in its entirety, including the transitively reachable part. It can still be readonly because it must not write to anything.

We can definitely specify this, I just don't know how to implement this.

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Feb 22, 2019

We can definitely specify this, I just don't know how to implement this.

Hm... I know nothing about these asm annotations, so this might make no sense, but is it possible to have two asm blocks, one that may read any memory transitively reachable from x and then write to some scratch space (just a stack slot we add for that purpose), and then the 2nd asm block may read+write just the two stack slots (x and the scratch space)? Shouldn't that make it readonly since the only thing the blocks can write to is stack slots?

@comex

This comment has been minimized.

Copy link

comex commented Feb 23, 2019

From LLVM's perspective, inline assembly invocations are actually call instructions, and thus have an associated FunctionType. So you can literally attach readnone and readonly to them. Clang, for its part, does so in some cases:

  // Attach readnone and readonly attributes.
  if (!HasSideEffect) {
    if (ReadNone)
      Result->addAttribute(llvm::AttributeList::FunctionIndex,
                           llvm::Attribute::ReadNone);
    else if (ReadOnly)
      Result->addAttribute(llvm::AttributeList::FunctionIndex,
                           llvm::Attribute::ReadOnly);
  }

ReadNone and ReadOnly start as true. The presence of the memory clobber or any memory outputs causes both to be set to false; memory inputs cause only ReadNone to be set to false.

The attributes seem to be the only mechanism by which LLVM determines the possible side effects of asm calls. In the case of ~{memory}, even though it is passed through to the LLVM IR and has no purpose other than affecting this determination, it seems to be completely ignored by the backend.

Currently, rustc never attaches readnone or readonly, and as a result, all Rust asm! statements are treated as if they could clobber memory, regardless of constraints. (playground link to test case)

This should be fixed to match Clang. In that case, I think asm!("" : : "*m"(x)) would suffice to imply readonly but not readnone.

...That said, I'm not convinced that such a subtle distinction is really worth providing a separate API for. In most cases it won't make a difference, and when it does, i.e. when optimizations are performed due to the use of bench_input instead of bench_output, I suspect they will more often than not contradict the user's intent.

@eddyb

This comment has been minimized.

Copy link
Member

eddyb commented Mar 12, 2019

@comex Keep in mind this "subtle distinction" also makes these functions very benchmark-centric and easily disavowable as "not for working around UB", which was my initial intention.

Also the different signatures should prevent most misuse IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.