Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC for a match based surface syntax to get pointer-to-field #2666

Closed
wants to merge 10 commits into from
331 changes: 331 additions & 0 deletions text/0000-pointer-match.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,331 @@
- Feature Name: `pointer-match`
- Start Date: 2019-03-21
- RFC PR: [rust-lang/rfcs#0000](https://github.com/rust-lang/rfcs/pull/0000)
- Rust Issue: [rust-lang/rust#0000](https://github.com/rust-lang/rust/issues/0000)

# Summary
[summary]: #summary

Extend match syntax and patterns by support for a limited set of operations for
pointers, which involve only address calculation and not actually reading
through the pointer value. Make it possible to use these matches to calculate
addresses of fields even for `repr(packed)` structs and possibly unaligned
pointers where an intermediate reference must not be created.

# Motivation
[motivation]: #motivation

To create a pointer to a field of a struct, there is currently no way in Rust
that avoids creating a temporary reference. Since reference semantics are
stricter, this may lead to silent undefined behaviour where that reference
should not be valid. Depending on the resolution of reference semantics this
affects:

* Creating a pointer to a field of a packed struct, where the reference may be
unaligned (depending on .
* Pointing to fields of an uninitialized type, where the reference points to
uninitialized data. This may be complicated by unions, where it could be
possible that not a single variant is currently completely initialized, yet
one wants to access some subfield. See
<https://github.com/rust-lang/unsafe-code-guidelines/issues/73#issuecomment-460634637>.
* Doing pointer offset calculations where the references does not refer to the
same, or any, allocation. This is because reference calculations are
performed with `getelementptr inbounds`.

# Guide-level explanation
[guide-level-explanation]: #guide-level-explanation

Match expression are extended from support for a reference binding mode, to a
pointer binding mode. Furthermore, a new pattern binds to a pointer, and
identifiers are extended to allow a new mode similar to `ref` and `ref mut`
binding to a reference. These patterns are called pointer pattern and raw
identifier for the remainder of the document.

```
#[repr(packed)]
struct Foo {
a: u16,
b: u32,
}

fn ptr_b(foo: &mut Foo) -> *mut u32 {
let Foo { raw mut b, .. } = foo;
b
}
```

Note that pointer binding mode and pointer pattern requires `unsafe`, even when
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There have been multiple references to these unsafe pointer patterns, but no examples or further explanations. What are they and why are they necessary? Aren't raw mut or raw const patterns sufficient?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pointer patterns are the counter part of reference patterns, necessary for disambiguation in some special cases. They use * <subpattern>, paralleling & <subpattern> and pointer binding mode automatically adds the top-level pointer pattern, the same as the reference pattern implied by reference binding mode. I'll add an example showing the necessity but it boils down to being able to match a pointer by-value and its content with raw pattern, the former being necessary for backwards compatibility.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will never dereference the pointer. But the arithmetic on the pointer may
implicitely overflow. Furthermore, not all patterns are (yet) allowed, to avoid
implicitely performing an unintended, unsafe read through the pointer. Pointer
binding mode will at first only permit ultimately binding with `raw` and `ref`
and not actually reading the contained memory.

The raw identifier pattern does not require `unsafe` on its own (as seen above,
where we safely match a `&mut` but bind to `*mut`).

This is not only useful for packed fields, but also to access the fields of
*any* object that is only available via pointer because its state invariants
may not yet be fulfilled. Instead of manually doing pointer math:

```
/// Repr Rust, so no layout guarantees, no pointer operations to get to `a`.
struct Weird {
/// Always valid, no matter the memory content.
something_i_dont_care_about: u8,

/// Must only be one of `true` and `false` for a &Weird.
a: bool,
}

/// Unsafety invariants: `w` must
/// * point to some allocation of at least `std::mem::size_of::<Weird>`.
/// * point to memory valid for the chosen lifetime `'a`
/// * be properly aligned.
unsafe fn get_if_init<'a>(w: *const Weird) -> Option<&'a Weird> {
let Weird { raw const a, ..} = w;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this auto-deref, or how does this typecheck?

Your example below with match (0 as *mut usize) does not look like auto-deref happens, but then shouldn't this be let *const Weird { raw const a, .. } = w?

Copy link
Author

@HeroicKatora HeroicKatora Mar 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

match doesn't auto-deref in the ops::Deref sense of expressions. I'd rather view it as only adding reference patterns automatically where required, the implementation kind of agrees. I only suggest automatically wrapping into pointer patterns if required as well (and maybe only in pointer binding mode). Which pattern to add should never be ambiguous–in a top-down pass of the pattern after the rhs of the match is typed we know the required pattern kind–but probably deserves its own section anyways.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your example below with match (0 as *mut usize) does not look like auto-deref happens

Automatically adding reference for pointers has the slightly unfortunate side-effect of colliding with the ability to match pointers by const value (i.e. core::ptr::null()), which does not exist for references afaik. The example was intended to show that disambiguation should be backwards compatible but didn't show automatically adding the pattern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

match doesn't auto-deref in the ops::Deref sense of expressions. I'd rather view it as only adding reference patterns automatically where required

I do not see how that makes any fundamental difference. The fact remains taht you can write a pointer deref and incur a memory access without ever typing *, such as in

fn foo(x: &bool) -> bool {
    match x {
        true => true,
        false => false,
    }
}

For raw pointers, we want to avoid this because accessing memory through them is unsafe and should only be done explicitly. In this sense, auto-deref on raw pointers and auto-&-pattern in match have the same concern.

This is not about ambiguity, this is about calling out to whoeever reads this code that a raw pointer is being dereferenced. I don't think this should compile:

fn foo(x: *const bool) -> bool {
    unsafe { match x {
        true => true,
        false => false,
    } }
}

Copy link
Author

@HeroicKatora HeroicKatora Apr 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That code would not work. While the pointer pattern is added automatically, it still does not allow any of the value-reading patterns to occur within it. So the value pattern true within in that example would desugar to *const true and then be denied on the grounds that a value pattern occurred inside a pointer pattern. I believe that the wording correctly captures this part of the mechanic but I could have misphrased it. Hence the distinction of auto-deref vs. adding a pattern, because the latter does not circumvent the pattern structure based checks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would indeed fix this concern. It wasn't clear to me from the RFC; there should be examples for this kind of rule.

match core::ptr::read(a as *const u8) {
0 | 1 => Some(std::mem::transmute(w)),
_ => None
}
}
```


# Reference-level explanation
[reference-level-explanation]: #reference-level-explanation

The newly introduced patterns are:

* `raw (const|mut) identifier`; allowed for field bindings and identifier bindings.
These are allowed in the grammar where `ref? mut? identifier` is allowed
currently. For this purpose `raw` is a contextual keyword.
* `* (const|mut) <subpattern>`; to match a pointer not by value but to
additionally use structural patterns to get pointers to the fields of its
underlying type. Their use requires an `unsafe`-block around the expression
in which they appear, be it match or irrefutable bindings. However,
`<subpattern>` does not allow arbitrary content, this is subject to
discussion and future options.

In pointer binding mode, the top-level pattern is wrapped in `* (const|mut)` if
it is a non-reference and non-pointer pattern. This should be analogue to
[reference binding
mode](https://doc.rust-lang.org/reference/patterns.html#binding-modes) where
the wrapping and existence of the pointer patterns serves as disambiguation in
fringe cases.

```
match (0 as *mut usize) {
// What's currently possible, this is a reference pattern and does no pointer-wrapping.
x => (),
// This reference pattern gets a pointer to the pointer.
raw const y => (),
// This explicit pointer-pattern gets a const pointer to the pointed-to place.
*mut raw const z => (),
}
```

The calculation of the value from a pointer pattern will not use an `inbounds`
qualifier when passed to llvm.

There is no restriction on raw-patterns appearing within matching of enum
variants and slices, such that this is possible:

```
#[repr(packed)]
Foo {
field: Enum,
}

enum Enum {
A(usize),
}

fn overwrite_packed_field(foo: &mut Foo ) {
// Actually safe!
let Foo { field: Enum::A(raw mut x), } = foo;

// Write itself not safe, as we write to a pointer :/
unsafe { ptr::write_unaligned(x, 0) };
}
```

Allowed [patterns](https://doc.rust-lang.org/reference/patterns.html) within
pointer patterns (and thus in the sugar of pointer binding mode) are: wildcard
patttern, path patterns that don't refer to enum variants or constants, struct
patterns, tuple patterns, fixed size array patterns, where the last three are
only not allowed to bind their fields with the new pointer pattern and with
`..`, potentially also with `ref mut? identifier`, but not `mut? identifier`.
Some further notes on (dis-)allowed patterns:

* The restrictions don't apply to matching the pointer value itself, as that
is not inside a pointer pattern.
* enum variants and constants obviously read their memory.
* literal, identifier, and reference patterns also constitute a read of the
pointed-to place, and implicitely assert their type's invariants. Better to
keep those operations separate.
* no pointer patterns within pointer patterns, must also actually read memory.
* `ref mut? identifier` may be useful, but may be too tempting sometimes. It
essentially performs a cast of pointer-to-reference and thus comes with the
same caveats: The programmer must ensure liveness and alignment. However,
cast with `transmute` or `as _` is much more explicit.

Since pointer patterns are guaranteed to not rely on the pointed-to memory
invariants, it can also be used to match union fields in interesting ways.
Maybe this is interesting for custom enum-like-encapsulations?

```
union Mix {
f1: (bool, u8),
f2: (u8, bool),
}

let mut m = Mix { f2: (3, true), };
// f1.0 is not validly initialized, don't grab reference.
let Mix { f1: (raw const f1_0_ptr, _), } = &m;
// Initialize f1.0 through valid f2.0
m.f2.0 = 0;
// Now we can grab the reference.
let f1_0 = unsafe { &*f1_0_ptr };
```

Match unsized values should also simply work, I don't see any complication over
matching those by reference as the pointer already includes the necessary
(length)-metadata. With regards to network protocols, this would become much,
much cooler with unsized unions but you can't have your cake and eat it, yet.

```
#[repr(C)]
struct net_pckt {
protocol_type: u8,
content: [u8],
};

unsafe {
// Works nicely even with changes to the packet structure.
let net_pckt { raw mut content, ..} = uninitialized_packet_ptr;
// Get a pointer two bytes into the content. The pointer has the necessary length-metadata.
match content {
[_, _, raw ptr] => /* Packet large enough */ (),
_ => return Err(Error::Truncated),
}
}
```

# Drawbacks
[drawbacks]: #drawbacks

Match syntax is 'more heavy' than a place based syntax in some or many cases.
On the other side of the coin, initializing a struct often involves grabbing
pointers to all fields, where matching is much terser than each indivdual
expression.

The additional pointer binding mode for match expressions may be confusing due
to the non-explicit pointer nature of its argument.

The pointer retrieved from `raw mut` binding while matching a `&mut _` value
upholds more guarantees than aparent, as it is known to be writable with
`ptr::write_unaligned`. Some yet-to-be-proposed encapsulation could thus make
this completely safe to the programmer. This is a drawback because of the next
argument.

Assigning semantics to the pattern matching of `*` and `raw` has the risk of
being too restricted for future operations but too constrained to allow
backwards compatible extension. Specifically, the type of `id` in a `raw id`
pattern may be hard to change but a pointer upholds almost no invariants on its
own.

# Rationale and alternatives
[rationale-and-alternatives]: #rationale-and-alternatives

`&raw <place>` was also proposed to achieve getting a pointer to a field. The
pattern/match syntax has several advantages over place syntax:

* Place expressions are overloaded with auto-deref, custom indexing
(`core::ops::Index`/`core::ops::IndexMut`), invoking arbitrary user code. A
solution with place syntax needs to explicitely forbid these forms of place
statements, both to disallow user code and avoid accidental reference
intermediates. The new statements thus resembles a very different other
statement.
* The initial dereferencing of the pointer necessary for a place expression
(`struct.field` is implicitely `(*struct).field` for a reference argument
`struct`) will not work with pointer arguments, which do no automatically
dereference even in unsafe code (and arguably should not, outside `&raw`).
* `raw` feels more natural when paralleling `ref` instead of appearing as yet
an *additional* qualifier on `&` that is not associated with pointers
in the first place and confusingly also requires `const` in spite of `&`
suggesting the opposite.
* It provides a clear pattern that extends to enum fields in packed structs,
which are not absolutely not expressible in place syntax.

In contrast, patterns fully follow the structural nature of algebraic data
types without customization points in the form of `core::ops`. This makes them
a perfect match when the possibilities should be restricted to exactly those
options.

Not doing this would keep surface level code for creating pointers error prone
or impossible, independent of the underlying MIR changes.

# Prior art
[prior-art]: #prior-art

C++ state-of-the-art, to my best knowledge, also uses the usual lvalue
expression for a pointer to a field. This has several pitfalls: Classes may
overwrite the pointer dereference operator `->`, and the pointer creation
operator `&`. Actually conformant generic code thus requires additional
artificial constructs and a syntax that does not resemble lvalue syntax.
Additionally, most of the operator are not defined while their target object is
not life, making them unfit for initialization of uninitialized objects.

C (and C++ to an extent) also have `offsetof`, a macro based solution to get
the byte offset of a field. This only works reliably for [a very restricted set
of types](https://en.cppreference.com/w/cpp/named_req/StandardLayoutType). This
essentially is the analogue of `#[repr(C)]` in Rust. A `static_assert` based
solution can help unwittingly triggering undefined behaviour on other types.

No other algebraic language with the memory model of Rust is known to the
author, thus comparisons in this way are sparse.

The PR [#2582](https://github.com/rust-lang/rfcs/pull/2582) contains the
necessary MIR operations to perform the address calculations themselves.

# Unresolved questions
[unresolved-questions]: #unresolved-questions

The exact syntax for pointer patterns, while `raw` as a contextual keyword has
already some association with pointer to place it need not be the final answer.
An alternative is, of course, a contextual keyword `ptr` for that pattern.
However, `ptr` will be more ambiguous should a similar syntax be adopted
outside of patterns.

The restrictions on pointer binding mode that are only based on not implicitely
reading memory (enum variants, constants, references, bindings) do not add real
safety, as the matching must occur within an `unsafe` block in any case.
However, they likely do protect against accidental usage similar to auto-deref
in a place expression. They may arguable be more a nuisance than a safety help
nonetheless.

Address calculation will likely depend on not overflow the pointer, i.e. behave
like `pointer::add` but could also utilize `pointer::wrapping_add` instead.
That would make the code safer but provide fewer optimization opportunities.
Also, wrapping addition could promote use to get (specific) field offsets,
within the limits of layout guarantees offered by rust. Since it occurs in an
unsafe block, the burden of fulfilling necessary preconditions ultimately
relies on the programmer.

`ref mut? identifier` within pointer patterns may be disallowed or not. `raw
identifier` pattern. For half-initialized structs where validity and alignment
of the underlying struct has been checked but `&mut` referencing the complete
struct is not safe due to uninitialized fields this is also useful.
Alternatively, this could be disallowed if not useful enough or it seems to
promote undefined behaviour.

# Future possibilities
[future-possibilities]: #future-possibilities

Some pointer binding matches may be safer than the required `unsafe` suggests:
For example the pointer retrieved from `MaybeUninit` guarantees that the memory
is actually backed by some allocation and thus the offset calculations can both
utilize `inbounds` and will never overflow. It could be possible to remove the
need for an `unsafe` block around such matches if they don't use any of the
memory-reading-patterns discussed in unresolved questions.