Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I need to do an oob vector load. How? #2

Open
brson opened this issue Jul 5, 2018 · 55 comments
Open

I need to do an oob vector load. How? #2

brson opened this issue Jul 5, 2018 · 55 comments
Labels
A-memory Topic: Related to memory accesses S-pending-design Status: Resolving this issue requires addressing some open design questions

Comments

@brson
Copy link

brson commented Jul 5, 2018

As an optimization during a buffer search, I need (very want) to load that buffer into a SIMD vector, even when the buffer doesn't fit into the vector. E.g. I might have a 31-byte buffer that can be efficiently searched with a 32-byte wide AVX2 vector.

From a machine perspective, I don't see this as a problem, as long as the load doesn't extend beyond the current page; from LLVM's perspective this seems like UB.

I'd really like to be able to write this code in Rust and not have to use assembly.

Here's an example of this pattern:

    #[inline(always)]
    unsafe fn do_tail_clever(needle: u8, p: *const u8, len: isize,
                             i: isize, q: __m256i) -> Option<usize> {
        let rem = len - i;
        debug_assert!(rem < 32);

        // Check if the 32-byte load is within the current page
        let page_alignment = 4096;
        let page_mask = !(page_alignment - 1);
        let current_p = p.offset(i) as usize;
        let avx_read_end = current_p + 32;
        let next_page = (current_p & page_mask) + page_alignment;

        if likely(avx_read_end <= next_page) {
            let x = _mm256_loadu_si256(p.offset(i) as *const __m256i);
            let r = _mm256_cmpeq_epi8(x, q);
            let z = _mm256_movemask_epi8(r);
            let garbage_mask = {
                let ones = u32::max_value();
                let mask = ones << rem;
                let mask = !mask;
                mask as i32
            };
            let z = z & garbage_mask;
            if z != 0 {
                return off(i, z);
            }

            return None;
        }

        // Slow path
        do_tail_simple(needle, p, len, i, q)
    }

It loads beyond the array, does vector operations on it, then disregards the oob bytes with a mask.

I'm hopeful that there is some mechanism to tell LLVM to 'forget' what it knows about this pointer, 'fooling' the optimizer into not messing with it.

From the LLVM aliasing rules, there is some language that makes me hopeful:

An integer constant other than zero or a pointer value returned from a function not defined within LLVM may be associated with address ranges allocated through mechanisms other than those provided by LLVM. Such ranges shall not overlap with any ranges of addresses allocated by mechanisms provided by LLVM.

So there is a class of pointers that can operate on arbitrary memory (those that don't come from LLVM). That suggests to me that I could e.g. send my pointer through assembly or some other black-box function to 'clean it', maybe. On the other hand, calling into any function, or even into inline asm imposes extra instructions that more-or-less defeat the optimization (inline asm in LLVM seems to always spill registers). Though that sentence also says "such ranges shall not overlap with any ranges of addresses allocated by mechanisms provided by LLVM"

I'm not sure how much 'wiggle-room' there is. Is a malloc'd array "provided by LLVM"? What are the consequences of disobeying this "shall not"?

Even if there's no in-language solution and it is technically UB, I am hopeful that I can do this thing without LLVM messing with my codegen.

cc @nikomatsakis writing this here per your request.

@brson brson changed the title I need to do an oob vector read. How? I need to do an oob vector load. How? Jul 5, 2018
@brson
Copy link
Author

brson commented Jul 5, 2018

One thing I could do here is track the capacity of the original vector, and only do the oob load if there's enough capacity. That would definitely reduce how often this could hit the fast path, but not sure how much.

Edit: NVM, this routine never sees the Vec capacity - it operates only on slices.

@RalfJung
Copy link
Member

RalfJung commented Jul 8, 2018

A very related question has recently come up on stackoverflow. Someone has been suggesting to read a full u32 through a u8 pointer, making sure that never crosses page boundaries so there can't be a SEGFAULT.

As already discussed there, I think there are two related but distinct problems before we can even start taking Rust's own rules into account: You are potentially performing accesses outside of any allocation (as you already mentioned), and if not then you may be racing with other accesses to the bytes outside if your buffer.

For the out-of-bounds part, that is pretty much entirely in LLVM's hands. Rustc/MIR is not doing anything interesting there, but LLVM certainly does (for example, when you are accessing some pointer x + 3 and you have another pointer that LLVM knows points into an object of size <= 2, it will assume these accesses do not alias). You'd have to find a way to work around that, preferably something sanctioned by LLVM. That's probably something that would require discussion on the llvm-dev list. (I am sure the need for this comes up in C as well.)

For data races, Rust officially is using the C11 memory model. Read-write races are immediate UB under that model. So, if the extra byte you are accessing is actually allocated and currently accessed by some other thread, you would introduce UB. However, LLVM says that such read-write races yield undef/poison (effectively: uninitialized bytes) instead of raising UB. If Rust decided to switch from C11's model to LLVM's, that would enable your use-case if you carefully decorate everything with MaybeUninit to inform Rust that there may be uninitialized data around here.
The trouble is that C11's model is much better studied and much more clearly defined by now.

Only if we solve those two points, our own (Rust-level) aliasing rules even become relevant. I could imagine us following LLVM's lead and making "bad" loads return undef instead of raising UB.

@brson
Copy link
Author

brson commented Jul 8, 2018

@RalfJung when you say "you may be racing with other accesses to the bytes outside if your buffer." What is the practical impact of that? In what concurrent/atomic scenario will my loads change the outcome for other thread? Eg making atomic values visible before they should be?

@RalfJung
Copy link
Member

RalfJung commented Jul 8, 2018

The practical impact is hard to determine. Compilers are allowed to and will perform optimizations that are only valid if non-atomic accesses never have a data race. Let me try to construct an example for how they might break when combining an otherwise correct unsafely implemented library with your code.

For example, in the following C code

int x = *x_ptr;
acquire_lock(l);
int y = *x_ptr;

gcc may and sometimes will replace the last line by int y = x;, which is correct because it knows there cannot be a concurrent write that could change the value behind x_ptr in the mean time. Now imagine a situation where a 32-byte (aligned) buffer (&mut [u8; 32]) is split into a 31-byte buffer (part1: &mut [u8; 31]) and a location (part2: &mut u8) that is put under the control of some unsafely implemented library lib. That library makes the location accessible from multiple threads and uses a lock stored somewhere else to synchronize (like a Mutex but with the data not stored in-band with the lock).

Now we have something like

let h = lib::put_under_library_control(part2);
something_that_uses_tail_clever(part1);
let val = h.get();

If everything gets inlined, this matches the C code above: tail_clever will read part2 but throw away the result, then h.get() will acquire a lock and read part2 again. The compiler may optimize this to use the result of the first read, assuming there are no data races -- and we got a miscompilation.

Now, this is clearly a very contrived example. But the point is, we cannot just ignore UB due to data races. The only thing we can do is pick different rules and make sure the compiler follows those rules -- LLVM will not perform the optimization outlined above precisely because under LLVM semantics, this read-write race is not UB.


Coming back to the higher level, I think this is an excellent example for why one may prefer the LLVM memory model over the C11 one. seqlocks are another example that causes trouble with the C11 memory model and AFAIK works fine with the LLVM model (though I have not seen an analysis of the latter).
There may be other arguments for the C11 model, e.g. I do not know the situation and DRF theorems (data-race-freedeom theorems) for the LLVM model. The C11 model has some pretty strong DRF theorems saying e.g. that a program that is race-free under sequential consistent semantics and only uses non-atomic and sequential consistent accesses, does not gain any additional behaviors when considering the full C11 semantics. These theorems ensure that programs not using the weaker access modes do not have to care. I haven't seen such theorems for the LLVM model, but that's just because I haven't seen that model studied very much at all.

@brson
Copy link
Author

brson commented Jul 8, 2018 via email

@RalfJung
Copy link
Member

RalfJung commented Jul 9, 2018

Yeah, unfortunately these inlining hints don't actually change the program semantics -- they affect what the compiler will do, but not what it could do. From a correctness stand-point, I do not know of a way to make inlining hints "mean" anything.

@avadacatavra avadacatavra added the A-memory Topic: Related to memory accesses label Aug 24, 2018
@gnzlbg
Copy link
Contributor

gnzlbg commented Nov 27, 2018

@Amanieu would like to be able to do oob atomic loads as well: rust-lang/rust#32976 (comment)

That's required for correctness, and is not an optimization AFAICT.

@Amanieu
Copy link
Member

Amanieu commented Nov 27, 2018

@RalfJung
Copy link
Member

@Amanieu what is that doing?

@Amanieu
Copy link
Member

Amanieu commented Nov 27, 2018

It is emulating 8/16/32 atomic operations on older ARM architectures (without atomic support) using a kernel-provided 32-bit cmpxchg function.

@RalfJung
Copy link
Member

Does LLVM know that these are 32bit memory accesses? Code in other translation units, compiled from a different language with different UB (and linked on the assembly level, i.e., in a language where this is not UB), does not have to follow the same rules. Syscalls are an extreme case of "different translation unit".

Only LLVM IR itself is subject to LLVM IR's rules. (Of course there must be some amount of interop, and a shared memory model, but that seems plausible in this case.)

@gnzlbg
Copy link
Contributor

gnzlbg commented Dec 13, 2018

Talking about doing SIMD loads OOBs:

if you carefully decorate everything with MaybeUninit to inform Rust that there may be uninitialized data around here.

@RalfJung that might be doable, but @brson would need to heavily re-write its code. Here:

let p: *const __m256i = /* ptr to allocation smaller than 32 bytes */;
let x: __m256i = _mm256_loadu_si256(p);

The problem is that core::arch::x86_64::_mm256_loadu_si256 returns an __m256i - not a MaybeUninit<__m256i> (which wouldn't help much), nor a Simd<[MaybeUninit<i64>; 4]>.

If packed_simd supported Simd<[MaybeUninit<i64>; 4]>, one could maybe write:

let p: *const Simd<[MaybeUninit<i64>; 4]> = /* ptr to allocation smaller than 32 bytes */;
let x: Simd<[MaybeUninit<i64>; 4]> = ptr.read_unaligned(p); 
// ^^ Is ptr::read_unaligned the right tool for reading memory OOB ?

where Simd<[MaybeUninit<i64>; 4]> would support the same API as Simd<[i64 ;4]> (comparisons, arithmetic, bit manipulation, reductions, etc.) but propagating undef.


Implementation wise, I don't really know how that would work. Adding the API to packed_simd is "trivial", but what LLVM-IR should it generate ? LLVM vectors are of type <N x T>, but I don't know whether we can put <4 x MaybeUninit_i64> there, and even if we could, whether LLVM could do something meaningful with it. Maybe an attribute <4 x maybe_undef i64> ?

@RalfJung
Copy link
Member

@gnzlbg Notice that this only "solves" the data-race part. One alternative (not correct in theory but experimentally confirmed to work in practice) is to use volatile reads for the non-atomic maybe-racy reads. LLVM didn't sanction this, and maybe we should have a discussion with them about this. Another alternative might be to use LLVM monotone accesses, not sure if anybody experimented with them in Rust yet.

None of this helps with the fact that the accesses are OOB. There is no solution to that other than having explicit support for this from LLVM.

@RalfJung RalfJung added the C-open-question Category: An open question that we should revisit label Aug 14, 2019
gnzlbg added a commit that referenced this issue Aug 27, 2019
rearrange a bit and be more explicit about how our rules interact
@RalfJung
Copy link
Member

RalfJung commented Nov 7, 2020

@Amanieu and @thomcc recently had a related discussion on Zulip. It seems the general preference is to permit this for volatile accesses, assuming we can get LLVM to sanction that.

My concern with this is that volatile will inhibit optimizations, which seems in opposition to the goal stated in the OP -- to use a vectorized loop for performance. So it might be that only giving volatile accesses "OOB powers" is not enough, we might also need some (opt-in) way to do this for regular accesses.

@chorman0773
Copy link
Contributor

I'd be concerned with allowing any kind of OOB access (or OOB pointer arithmetic, note that wrapping_add would be implemented as integer arithmetic). The second it can cross into an unreachable object, either you definately do have undefined behaviour, or way too many optimizations go out the window. There could also be concerns about allowing OOB Access period, as padding could be theoretically manipulated to store internal compiler state when there isn't a chance of it getting overwritten.

@RalfJung
Copy link
Member

RalfJung commented Nov 7, 2020

note that wrapping_add would be implemented as integer arithmetic

FWIW, it currently is not -- it is implemented as getelementptr without "inbounds". Speaking in terms of LLVM semantics, this preserves provenance, which integer arithmetic will not (assuming LLVM wants to support the usual arithmetic identities).

I am not sure what you mean by "would".

The second it can cross into an unreachable object, either you definately do have undefined behaviour, or way too many optimizations go out the window.

This is exactly why we use getelementptr for wrapping_add: it cannot cross allocation boundaries.

There could also be concerns about allowing OOB Access period, as padding could be theoretically manipulated to store internal compiler state when there isn't a chance of it getting overwritten.

AFAIK we are only talking about reads here. I do not know of a reasonable way to permit OOB writes.

OOB reads would return "uninit" for the OOB part, even if that happens to be in-bounds for another object. This should hopefully suffice to preserve optimizations.

@chorman0773
Copy link
Contributor

chorman0773 commented Nov 7, 2020

I am not sure what you mean by "would".

In this case, I am refering the lccc model, in which pointer arithmetic comes straight out of the C and C++ Standards.

it cannot cross allocation boundaries.

wrapping_add (or actually it might be called wrapping_offset) can. offset cannot cross allocation boundaries, that's UB. Integer arithmetic from a pointer value on lccc does preserve provenence as long as it the equivalent operation applied to the pointer value would have defined behaviour (That is, given p is *mut T, (p as usize + 4*size_of::<T>()) as *mut T would be the same value as p.offset(4), if that expression has defined behaviour, otherwise the result is an invalid pointer).

OOB reads would return "uninit" for the OOB part, even if that happens to be in-bounds for another object.

Returning uninit from OOB may be fine. However, as I have mentioned in #76, for scalar values, uninit in lccc is poisoning (if one byte of a scalar object is uninit, the entire value is uninit). This shouldn't cause issues, at least in lccc, provided the read from type doesn't have any validity requirements. Note that for volatile, this is less of an issue, as volatile accesses are always freezing in lccc (which prevents the posioning of the entire value, since that occurs on reads and writes).

@RalfJung
Copy link
Member

RalfJung commented Nov 7, 2020

wrapping_add (or actually it might be called wrapping_offset) can.

No it cannot. Quoting from the docs:

In particular, the resulting pointer remains attached to the same allocated object that self points to. It may not be used to access a different allocated object. Note that in Rust, every (stack-allocated) variable is considered a separate allocated object.

@chorman0773
Copy link
Contributor

What I mean is that it's valid to use wrapping add to exceed the allocation, you just can't access outside. add cannot get a pointer outside the allocation, full stop. wrapping_add can, but it cannot be derefenced (which is why I noted that the implementation of p.wrapping_add(4) would return the value of p.add(4) if it the latter is defined, otherwise an invalid pointer. invalid pointers are ub to do much of anything with, or claim much of anything about)

@RalfJung
Copy link
Member

RalfJung commented Nov 7, 2020

What I mean is that it's valid to use wrapping add to exceed the allocation, you just can't access outside. add cannot get a pointer outside the allocation, full stop. wrapping_add can, but it cannot be derefenced

Ah, that is a terminology difference then. I would say that the pointer you get from wrapping_add never enters another allocation -- its provenance still stays attached with the original allocation. So, it can never "point to another allocation", even if its integer address is inside another allocation.

the implementation of p.wrapping_add(4) would return the value of p.add(4) if it the latter is defined, otherwise an invalid pointer.

That would not be a correct implementation. p.wrapping_add(400).wrapping_sub(400) returns the original pointer, even if the intermediate pointer is out-of-bounds.

@chorman0773
Copy link
Contributor

That would not be a correct implementation. p.wrapping_add(400).wrapping_sub(400) returns the original pointer, even if the intermediate pointer is out-of-bounds.

lccc also has a reverse round-trip rule to complement the round-trip rule (this also might be required by C++, idk), which says that if x is an appropriately sized integer type U, (x as *mut T).add(n) as U has the value x+n*size_of<T>() (provided the former has defined behaviour). These two rules combined make that situation well-defined and correct, even though the intermediate pointer is invalid (note: an invalid pointer is not the same as an invalid value, or particularily "the" invalid value). These two rules together reduce that entire operation to just p.

@comex
Copy link

comex commented Nov 7, 2020

Returning uninit from OOB may be fine. However, as I have mentioned in #76, for scalar values, uninit in lccc is poisoning (if one byte of a scalar object is uninit, the entire value is uninit). This shouldn't cause issues, at least in lccc, provided the read from type doesn't have any validity requirements.

The OP's use case doesn't just need the load to be non-UB, it needs the load to produce a value where the bits corresponding to in-bounds bytes are correct. So it seems like either you must track uninitializedness on a per-bit level (as LLVM does), or this must be a special kind of load which produces something different from normal uninitialized values.

@RalfJung
Copy link
Member

RalfJung commented Nov 7, 2020

you must track uninitializedness on a per-bit level (as LLVM does)

I don't think it does... at least, with the proposal to track this via poison, an iX is either fully poison or fully initialized.

However, an [iN x M] has per-element poison tracking.

@chorman0773
Copy link
Contributor

Usually, the compiler has to prove that a platform load correctly implements an Abstract Machine load; once you go OOB, that responsibility would be shifted to the programmer.

That seems to contradict the definition of a lang spec, which is that absent UB, the compiler has to correctly implement the spec. Saying "if the access would violate the semantics of the abstract machine, the behaviour is undefined", to me, either says the behaviour is undefined always (defeating the purpose) or never (making all out-of-bounds accesses well-defined, which is equally brilliant). A workaround may be to just say conditionally-supported and leave the implementation to decide when it is and is not. However, this wouldn't necessarily help portability, as an implementation could simply document that "Out-of-bounds accesses are never supported" and we are back to where we started (Though if enough implementations say it works in a particular circumstance, it becomes a defacto standard).

@comex
Copy link

comex commented Nov 9, 2020

It's reasonable enough that some implementations may not be able to support out-of-bounds accesses at all, such as any implementation targeting CHERI (and ARM's implementation, Morello) which features byte-granularity memory protection. That could be considered equivalent to a page size of 1.

@RalfJung
Copy link
Member

RalfJung commented Nov 9, 2020

That seems to contradict the definition of a lang spec, which is that absent UB, the compiler has to correctly implement the spec. Saying "if the access would violate the semantics of the abstract machine, the behaviour is undefined", to me, either says the behaviour is undefined always (defeating the purpose) or never (making all out-of-bounds accesses well-defined, which is equally brilliant).

We have things like inline assembly and volatile accesses where parts of the target semantics "leak through" into the program. This would be similar.

The problem is then a type with size N has 2^N different uninit values (power set of the byte range. However, because we have a O(lg(N)) function for space to store this value, this is only O(N) extra space overall, so this is more annoying than a problem). For context, the largest scalar type supported by lccc is _Complex(unsigned long long attribute((mode(OI)))) in C (thanks gcc, for both of those), which is 64 bytes.

I do not understand the concern here... the space of "abstract bytes" is already much bigger than 256 elements due to provenance, having one extra bit for tracking initialization really does not make much of a difference. It's not like the compiler has to actually manifest these bits; this is a spec-only concept / "ghost state".

@chorman0773
Copy link
Contributor

It's not like the compiler has to actually manifest these bits; this is a spec-only concept / "ghost state".

It would have to track them at compile time. Also, fair point about provenance. Pointers are annoying (I will admit that an arbitrary graph of derivations and limits is probably more complex than storing an additional 64-bit state for undef uninit).

We have things like inline assembly and volatile accesses where parts of the target semantics "leak through" into the program.

While I do not know off the top of my head about rust for this, Inline Assembly (called an assembly declaration) is conditionally-supported with implementation-defined semantics in C++, and it leaves it at that (which is really a brilliant definition for it, staying as far away as possible from what it actually does). Also, volatile accesses in C++ are entirely well-defined without talking about how the processor works, they must be evaluated strictly according to the rules of the abstract machine.

@RalfJung
Copy link
Member

RalfJung commented Nov 9, 2020

It would have to track them at compile time.

No, it would not. You can soundly approximate "partially initialized" as "fully initialized" if you have some analysis that does not need byte-level precision.

Inline Assembly (called an assembly declaration) is conditionally-supported with implementation-defined semantics in C++, and it leaves it at that (which is really a brilliant definition for it, staying as far away as possible from what it actually does)

IMO it's a horrible definition as it answers basically none of the interesting questions around it. There's some discussion for the Rust semantics here; as you can see, saying "it's implementation-defined" is woefully inadequate as a specification.

Also, volatile accesses in C++ are entirely well-defined without talking about how the processor works, they must be evaluated strictly according to the rules of the abstract machine.

That definition, too, says basically nothing. Everything needs to be evaluated according to the rules of the abstract machine, that's what the as-if rule says. "Strictly" is meaningless here, or at least I have yet to see a proposal for turning it into a precise definition. There are some threads discussing volatile semantics in this issue tracker. Again, the C++ spec is woefully incomplete for answering key questions, such as the exact reasoning principles clients can use when reasoning about code performing volatile accesses, or the reasoning principles compiler authors need to use to evaluate whether an optimization is correct.

But I did not intend to discuss the semantics of volatile or inline assembly here, and we clearly have very different expectations when it comes to how precise a spec ought to be. I just said that those are existing mechanisms to let target-specific semantics "leak" into Rust programs. It takes a bit of work (much more than what C/C++ do) to make that precise, but I it is possible. The same approach can then also be used to put precise bounds and contracts on OOB loads.

@chorman0773
Copy link
Contributor

The same approach can then also be used to put precise bounds and contracts on OOB loads.

I would love to see a precise definition of inline assembly that satisfies every possible variation. Similarily, I'd like to see a precise definition for allowing OOB, without constraining implementations to talking about explicit target semantics. The best I can think of is that it can either result in an indeterminate byte/value, or raise an implementation-defined signal (which may be more reasonable, the more I think about it).

@RalfJung
Copy link
Member

Yeah, I'd love to work that out in more detail, but I need to spell out the foundations for how I'd like to structure the Abstract Machine first.

The best I can think of is that it can either result in an indeterminate byte/value, or raise an implementation-defined signal (which may be more reasonable, the more I think about it).

I think we have to explicitly exclude the possibility of a signal being raised, because that is observable which would prevent reordering. That's why I said earlier

the programmer has to ensure that the OOB load has no further side-effects on the underlying platform

The signal would be one of these side-effects. IOW, as a programmer doing such a load, it is your obligation to ensure that the load does not cause any signal to be raised. This is how the requirement to not cross page boundaries arises, in a way that can be stated without mentioning pages.

@chorman0773
Copy link
Contributor

chorman0773 commented Nov 27, 2020

I actually had an idea about how this could be specified. Let me know if this sounds good (actual wording can be adjusted):
In core::ptr::volatile_read:

  • If this function would access any bytes outside of the provenance of ptr, the access is conditionally-supported. If supported, the value of those bytes is unspecified(and may be uninitialized), or an implementation defined signal is raised. The behaviour is undefined if the resulting value is not valid for the type of the access.

For core::ptr::read:

  • If this function would access any bytes outside the provenance of ptr, the access is conditionally-supported. If supported, any such bytes read are uninitialized. If the corresponding call to core::ptr::volatile_read is unsupported, or would result in a signal being raised, the behaviour is undefined. The behaviour is undefined if the resulting value is not valid for the type of the access.
  • Note 1 - A particular call to core::ptr::read need not be supported, even if the equivalent call to core::ptr::volatile_read is, and would not cause a signal to be raised - End Note

@thomcc
Copy link
Member

thomcc commented Nov 27, 2020

The behaviour is undefined if the resulting value is not valid for the type of the access

Just to clarify, this just means the resulting value must satisfy the validity requirements of its type right, and isn't introducing any sort of typed memory?

@chorman0773
Copy link
Contributor

chorman0773 commented Nov 27, 2020 via email

@RalfJung
Copy link
Member

If this function would access any bytes outside of the provenance of ptr,

I don't think volatile should be allowed to bypass provenance rules such as Stacked Borrows. Allowing that would inhibit all the optimizations the aliasing rules are meant to enable. This should be strictly about "outside the bounds of an allocation", not "outside the bounds of what provenance says can be done".

How is "conditionally-supported" defined? Is this like "implementation-defined", in that implementations need to state the conditions under which it is supported? If so, what would be something an implementation could say to actually enable OOB accesses?

One main point of a spec is to enable programmers to reason that their code is correct, and I do not think your spec lets them do that. The spec needs to answer the question "as a programmer, what do I need to do to ensure that my program will behave correctly after compilation".

@chorman0773
Copy link
Contributor

(Reposted because reply by mail works flawlessly)

I don't think volatile should be allowed to bypass provenance rules such as Stacked Borrows. Allowing that would inhibit all the optimizations the aliasing rules are meant to enable.

Fair point, and it could be changed to talk about the same thing. However, isn't the upper-bound of pointer provenance the allocation it points into? Additionally, unspecified and may be uninit is extraordinarily permissive (on the level of an indeterminate value in C, defined as an unspecified value or a trap representation). It wouldn't even have to represent any possible state the byte held when the read occurred, even with other (non-volatile) writes reordered, so this would seem to keep the optimizations intact, aside from reordering the read, which can't be done anyways (as it's volatile).

How is "conditionally-supported" defined

The implementation chooses whether it is supported at all, and documents if and when it is not.

One main point of a spec is to enable programmers to reason that their code is correct, and I do not think your spec lets them do.

In all cases, you'd need to look at the documentation for the particular compiler, and certainly never use any type that has a validity invariant stricter than u8 or MaybeUninit<u8> (depending on whether or not UCG allows uninit integers). Volatile is probably the easiest to reason about, it's well-defined (from a language perspective) provided it's supported and you don't violate the validity invariant. For non-volatile it's harder. Maybe if we have an implementation-defined "buffer zone" that you can read freely and know that it won't raise a signal (and thus be UB for non-volatile). Implementation-defined is really the best that can be done, though.

@RalfJung
Copy link
Member

Additionally, unspecified and may be uninit is extraordinarily permissive (on the level of an indeterminate value in C, defined as an unspecified value or a trap representation). It wouldn't even have to represent any possible state the byte held when the read occurred, even with other (non-volatile) writes reordered, so this would seem to keep the optimizations intact, aside from reordering the read, which can't be done anyways (as it's volatile).

Good point. Since this is only about reads and it doesn't actually "leak" any information, it is hard to imagine this breaking any optimization.

I guess you are coming from the perspective that the bounds of an allocation are themselves just an expression of provenance? In my mental model, allocations fundamentally have a given size, and the gaps between allocations have no value associated with them at all (not even Uninit). So OOB errors are of a different nature than Stacked Borrows provenance errors. (This is also how things are implemented in Miri). It seems you are viewing OOB as "just another kind of provenance error", and I can see how that view is appealing. However, it is not obvious to me that the correctness proofs we have for provenance-based optimizations in Stacked Borrows will easily carry over to a semantics where provenance may be violated on reads but the read then yields Uninit. Intuitively this makes sense; doing the proof is a different game. ;)

In all cases, you'd need to look at the documentation for the particular compiler

So it seems like you just moved the hard work of specifying OOB loads such that the above code is allowed to the compiler. That's not solving the problem though. I don't think we are done here until we have a proposal for a spec that actually permits the kind of code the OP is asking for. So in terms of your proposal that would mean not only writing the relevant part of the Rust spec, but also writing the relevant part of the rustc docs that complete the spec to an actually concrete semantics, so that code authors can point to those docs and say "my code is correct because of what it says here".


Also, I noticed your proposal permits a signal to be raised. I don't think that's a good idea, since it makes things observable that really shouldn't be observable. As I said before: "I imagined some language where the programmer has to ensure that the OOB load has no further side-effects on the underlying platform. Usually, the compiler has to prove that a platform load correctly implements an Abstract Machine load; once you go OOB, that responsibility would be shifted to the programmer."

In other words, we basically require a proof from the programmer that a load instruction on the underlying hardware with the given size will correctly implement an Abstract Machine load. Or putting it differently, Behavior is Undefined unless a load instruction on the underlying hardware correctly implement an Abstract Machine load. "Correctly implement" unfortunately depends on the concrete simulation relation used by the implementation in question, but I think we can say for sure that it involves "no side-effects" and "always returns successfully", which rules out signals.

For example, on x86-64 we should be able to say that the load needs to be fully within a page such that there provably is a pointer that is dereferencable for size 1 pointing to the same page. (Optimizations might replace memory by registers, so there might not actually be any physical page, but then the OOB part also has no chance of triggering a signal so we should be good.) When doing the correctness argument for the compiler, this should be sufficient to prove that the load will always complete and never raise a signal. And when reasoning about our code as a programmer, this gives us enough information to actually say for sure that our code will be correct.

In fact, if there are no other conditions required to make such a load work, we could even make the page size implementation-defined and fix everything else. Implementations can still pick a page size of 1 to avoid making any promises. Then if there is a constant like core::mem::PAGE_SIZE, programmers can write code that will work with any implementation. ("Page size" might be a bad term for this as it doesn't have to match physical memory pages; suggestions welcome.)

@chorman0773
Copy link
Contributor

Also, I noticed your proposal permits a signal to be raised

Only for volatile reads, which are already observable. For non-volatile reads it's UB if the equivalent volatile read would raise a signal. This preserves the optimizations for reordering non volatile accesses. I don't see how adding the option for volatile reads to trap within defined behaviour would inhibit too many optimizations, as volatile is very limited in how it can be optimized.

that responsibility would be shifted to the programmer

This would apply here, in order to validly perform a non-volatile read, you would have the responsibility of ensuring the volatile read wouldn't trap. The minimum buffer width would provide some of that, by giving a sequence of bytes known to be correct.

I guess you are coming from the perspective that the bounds of an allocation are themselves just an expression of provenance

Kind of, in lccc, they are equivalent (or at least related) concepts, the reachability of a pointer. The reachability of a pointer to an object is defined as the largest sequence of bytes that are part of the object-representation of the largest object pointer-interconvertible with it, and the immediately enclosing array thereof (with some exclusions to permit unique and readonly optimizations). This, and the reachability of any pointer that can be validly created from it, would be the provenance of that pointer in rust terms. So under this model, the bounds of the allocation provides an upper-bound for the reachability, and thus the provenance.

but also writing the relevant part of the rustc docs that complete the spec to an actually concrete semantics

For rustc would the following be good:

Volatile and non-volatile out-of-bounds accesses are supported, provided the pointer is into an allocation of at least one byte and is non-null. For volatile accesses, it is guaranteed not to raise a signal if the accessed byte is in the same page as any byte in that allocation, otherwise, it is not specified whether the access raises an asynchronous SIGSEGV. Pages are aligned to their size. The size of pages is platform-dependant, and is a power of 2.

And then provide examples of page sizes, like x86-64 has 4096 bytes in a page.

Or putting it differently, Behavior is Undefined unless a load instruction on the underlying hardware correctly implement an Abstract Machine load

I think fundamentally, this requires a lot more knowledge then this, and certainly a lot more than what mine would. The implementation is bound to emulate the observable behaviour of the abstract machine unless it contains UB. By shifting the burden of ensuring the evaluation does so onto the programmer, I'd argue you've created a circular case. The implementation is required to perform the access correctly if the implementation performs the access correctly. The underlying hardware is, after all, a part of the implementation. An implementation could "support" it, but then choose a mechanism for emulating the load that is never correct for OOB, and under this idea, that would be valid.

@RalfJung
Copy link
Member

RalfJung commented Dec 2, 2020

I think fundamentally, this requires a lot more knowledge then this, and certainly a lot more than what mine would. The implementation is bound to emulate the observable behaviour of the abstract machine unless it contains UB. By shifting the burden of ensuring the evaluation does so onto the programmer, I'd argue you've created a circular case.

I have done no such thing, I have just provided a way to "plug in" to what the compiler does so that the user can help the compiler complete its argument.

But that was anyway just the explanation for how to arrive at the proposal I made at the end of my post, which I think is fairly close to what you proposed for the rustc docs. However, by moving everything relevant into the rustc docs you made it impossible to do OOB accesses in Rust code that can be compiled with more than 1 compiler, hence my proposal to put something like a page size all the way into the spec.

@chorman0773
Copy link
Contributor

chorman0773 commented Dec 2, 2020

hence my proposal to put something like a page size all the way into the spec.

Doing that may work, but it may leave certain kinds of implementations off the table. And, even then, this has the same defect really, except giving a way to express this limit. Although, now that I put that in words, it is kind of growing on me. I'm wondering about a hybrid one, that combines the two. So perhaps I revise the specification as follows for core::ptr::read_volatile:

  • If any byte accessed is not within the provenance of the pointer, then the access is conditionally-supported. If the accessed byte is on the same page as any byte which can be accessed by the pointer, the result is unspecified (and may be uninitialized). Otherwise, the resulting byte is uninitialized or an implementation-defined signal is raised. If the resulting value is invalid for the type of the access, the behaviour is undefined. Pages are a sequence of contiguous bytes of an implementation-defined size, which are aligned to their size.

And then core::mem::PAGE_SIZE:

The value of type usize which is the implementation-defined size (and thus alignment) of pages or the value 0 to indicate that pages are not distinguished by the implementation. Shall be a power of 2, or the value 0.

Does the above sound good?

@RalfJung
Copy link
Member

RalfJung commented Dec 2, 2020

And, even then, this has the same defect really, except giving a way to express this limit.

Being able to query the limit from inside the code makes all the difference, IMO.

For your proposed read_volatile spec, two questions:

  • Why support the cross-page case at all? I think this has not been requested in this thread. So for now I'd prefer to limit this to same-page (i.e., no-signal) accesses. I'm in favor of trying to solve one problem at a time. :)
  • "the result is unspecified" -- I think it is important to say that the out-of-bounds bytes are unspecified. And since "they are uninitialized" is observably equivalent to "they may be uninitialized", I think we can just say that they are uninitialized.

For the PAGE_SIZE, why not use "1" as sentinel value for "pages are not distinguished"? Then this would always be a power of two.

@chorman0773
Copy link
Contributor

chorman0773 commented Dec 2, 2020

Why support the cross-page case at all

I don't think it's necessarily bad to say it cannot be supported, and saying it can raise a signal even if it is supported, I think is reasonable. The main issue I've heard from this thread against raising a signal is that it would inhibit some reordering optimizations, but volatile operations already do so, and are already observable behaviour. An implementation could also choose not to support cross-page access under the blanket conditionally-supported. It would simply have to document this choice.

I think it is important to say that the out-of-bounds bytes are unspecified

That is true, that wording can be fixed. I think it was in the original version, but got left out in the rewrite. As for why it's unspecified (and may be uninitialized), I think saying the implementation is allowed to produce a particular value is ok, and this matches the C definition of an indeterminate value ("An unspecified value or a trap representation", and uninitialized bytes are a trap representation). An implementation, for example, could freeze all volatile accesses. This indicates that is a valid implementation.

For the PAGE_SIZE, why not use "1" as sentinel value for "pages are not distinguished"

By "pages are not distinguished" I mean a fictious implementation that doesn't have pages, so the volatile read could never trap (IE. the page size is 2^n where n is 8*size_of<*const T>()). A PAGE_SIZE value 1, in contrast, means that each individual byte is a different logical page, so the volatile access can always trap (it may not necessarily trap, but it can).

@chorman0773
Copy link
Contributor

I'd note that in the above case, the documentation for rustc would then be the page size (and thus the value of core::mem::PAGE_SIZE), as well as any signals the cross-page access can raise (or if it would never support cross-page access, that choice), which would probably be SIGSEGV (at least on unix-like operating systems).

@chorman0773
Copy link
Contributor

Now, of course, the real question is whether or not these rules can be implemented on an llvm backend.

@RalfJung
Copy link
Member

RalfJung commented Dec 6, 2020

I just think it is easier to solve these problems in isolation than trying to solve more problems at once.^^ That's why I'd prefer to keep cross-page accesses out of the discussion. shrug

@chorman0773
Copy link
Contributor

I just think it is easier to solve these problems in isolation than trying to solve more problems at once.

Possibly. In my opinion, the cross-page access problem isn't necessarily being solved directly, it's just being solved as a side-effect of solving the main problem, though I can see the opposite argument. In either case, the rule I proposed for read_volatile doesn't necessarily need the cross-page rule (and going accross pages could just be made into blanket UB). So that can be removed if we are completely adamant against solving the problem now (or if the proposed solution is deficient in some reasonable manner), and then what has been proposed can be used to direct future solutions if and when one is needed or desired. However, if it is a reasonable solution, I don't see why it can't be adopted now.

@comex
Copy link

comex commented Jan 10, 2023

(Two years later…)

This pattern came up as a concern in an LLVM discussion about changing uninitialized reads to return poison instead of undef:

https://discourse.llvm.org/t/rfc-load-instruction-uninitialized-memory-semantics/67481/4

@JakobDegen JakobDegen added S-pending-design Status: Resolving this issue requires addressing some open design questions and removed C-open-question Category: An open question that we should revisit labels May 23, 2023
@JakobDegen
Copy link
Contributor

Briefly discussed in backlog bonanza: This is still open. Rust does not support it today, but it seems plausible to have in the language at some point

@RalfJung
Copy link
Member

RalfJung commented Apr 5, 2024

We actually now have an intrinsic that can do something like this: simd_maksed_load. However, you need to produce a mask that indicates which parts of the vector are in-bounds and which are not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-memory Topic: Related to memory accesses S-pending-design Status: Resolving this issue requires addressing some open design questions
Projects
None yet
Development

No branches or pull requests

9 participants