New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Symbol Mangling v2 #2603

Open
wants to merge 11 commits into
base: master
from

Conversation

Projects
None yet
@michaelwoerister

michaelwoerister commented Nov 27, 2018

Rendered
Reference Implementation
Pre-RFC

Summary

This RFC proposes a new mangling scheme that describes what the symbol names generated by the Rust compiler. This new scheme has a number of advantages over the existing one which has grown over time without a clear direction. The new scheme is consistent, does not depend on compiler internals, and the information it stores in symbol names can be decoded again which provides an improved experience for users of external tools that work with Rust symbol names. The new scheme is based on the name mangling scheme from the [Itanium C++ ABI][itanium-mangling].

Motivation

Due to its ad-hoc nature, the compiler's current name mangling scheme has a
number of drawbacks:

  • It depends on compiler internals and its results cannot be replicated by another compiler implementation or external tool.
  • Information about generic parameters and other things is lost in the mangling process. One cannot extract the type arguments of a monomorphized function from its symbol name.
  • The current scheme is inconsistent: most paths use Itanium style encoding, but some don't.
  • The symbol names it generates can contain . characters which is not generally supported on all platforms. [1][2][3]

The proposed scheme solves these problems:

  • It is defined in terms of the language, not in terms of compiler data-structures that can change at any given point in time.
  • It encodes information about generic parameters in a reversible way.
  • It has a consistent definition that does not rely on pretty-printing certain language constructs.
  • It generates symbols that only consist of the characters A-Z, a-z, 0-9, and _.

This should make it easier for third party tools to work with Rust binaries.

@michaelwoerister michaelwoerister changed the title from Symbol Mangling v2 to RFC: Symbol Mangling v2 Nov 27, 2018

@eddyb

This comment has been minimized.

Member

eddyb commented Nov 27, 2018

For the record, I'll be starting the compiler implementation/integration work ASAP, to get this RFC in rustc nightly, and later on, in other tools (such as GDB, LLDB, etc.).

Doing this at the same time as the RFC will give us the ability to collect data at scale, and figure out edge cases and performance tradeoffs we might miss otherwise.

### Methods
Methods are nested within `impl` or `trait` items. As such it would be possible to construct their symbol names as paths like `my_crate::foo::{{impl}}::some_method` where `{{impl}}` somehow identifies the the `impl` in question. Since `impl`s don't have names, we'd have to use an indexing scheme like the one used for closures (and indeed, this is what the compiler does internally). Adding in generic arguments to, this would lead to symbol names looking like `my_crate::foo::impl'17::<u32, char>::some_method`.

This comment has been minimized.

@eddyb

eddyb Nov 27, 2018

Member

In the interest of keeping this RFC sufficiently detached from current implementation details, can we use some more general placeholder notation, such as <impl>, instead of {{impl}}?

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 28, 2018

It's just an example of how not to do it. The {{xyz}} notation is meant to remind of what some templating engines use, not what the compiler did at some point. But I can change it to <impl> if you prefer that.

- Identifiers and trait impl path roots can have a numeric disambiguator (the `<disambiguator>` production). The syntactic version of the numeric disambiguator maps to a numeric index. If the disambiguator is not present, this index is 0. If it is of the form `s_` then the index is 1. If it is of the form `s<base-62-digit>_` then the index is `<base-62-digit> + 2`. The suggested demangling of a disambiguator is `[<index>]`. However, for better readability, these disambiguators should usually be omitted in the demangling altogether. Disambiguators with index zero can always be omitted.
The exception here are closures. Since these do not have a name, the disambiguator is the only thing identifying them. The suggested demangling for closures is thus `{closure}[<index>]`.

This comment has been minimized.

@eddyb

eddyb Nov 27, 2018

Member

Similarly here, we should avoid braces. What does C++ do for its lambdas?

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 28, 2018

GCC uses something with braces and indices too:

int square(int num) {
    auto foo = [num]() -> int { return num * num; };
    return foo();
}

The closure is demangled as square(int)::{lambda()#1}::operator()() const
(see https://godbolt.org/z/TaXWCe)

This comment has been minimized.

@eddyb

eddyb Nov 28, 2018

Member

Do debuggers work well with it? If so, how? Can we do some tests to see what works and what doesn't?

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 28, 2018

I assume that debuggers treat lambdas as regular operator() methods. What kind of tests did you have in mind?

This comment has been minimized.

@eddyb

eddyb Nov 28, 2018

Member

I'm referring to the problems @m4b mentions in #2603 (comment), regarding debuggers not being able to let you refer to symbol names that contain { (or perhaps only {{?).

If we change {{closure}} in the compiler with some other notation, we can see how well GDB and LLDB interact with the symbol names.

Although it's possible debuggers only handle such symbol names when they come from a mangling, which would mean debuggers should just pick a demangling that works for them, right?

The exception here are closures. Since these do not have a name, the disambiguator is the only thing identifying them. The suggested demangling for closures is thus `{closure}[<index>]`.
- In a lossless demangling, identifiers from the value namespace should be marked with a `'` suffix in order to avoid conflicts with identifiers from the type namespace. In a user-facing demangling, where such conflicts are acceptable, the suffix can be omitted.

This comment has been minimized.

@eddyb

eddyb Nov 27, 2018

Member

Wouldn't that include all the statics and functions? Seems a bit excessive.

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 28, 2018

It does, but I don't think there's a way around it. Otherwise you get conflicts for examples like:

fn foo() {
    fn bar() {}
}

mod foo {
    fn bar() {}
}

Note though that this is only for "lossless" demanglings. For most user-facing demanglings, like in debuggers or backtraces, the suffix can just be omitted. I suggest that demanglers support lossless or verbose option that is usually set to false.

struct Foo<T>(T);
impl<T> Clone for Foo<T> {
fn clone<U>(_: U) {

This comment has been minimized.

@eddyb

eddyb Nov 27, 2018

Member

Clone::clone can't take type parameters.

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 28, 2018

Yeah, I'll come up with a better example.

This comment has been minimized.

@eternaleye

eternaleye Nov 28, 2018

Borrow<T> could work well

EDIT: Wait, no, missed you wanted the type param on the method.

}
```
- unmangled: `mycrate::Foo::bar::QUUX`
- mangled: `_RNMN11mycrate_xxx3FooE3barV4QUUXVE`

This comment has been minimized.

@eddyb

eddyb Nov 27, 2018

Member

Shouldn't these mention type parameters?

This comment has been minimized.

@arielb1

arielb1 Nov 27, 2018

Contributor

Sure enough. You want to be able to distinguish between these 2 cases (this code compiles today):

struct Foo<U,V>(U,V);

impl<U: Fn()> Foo<U, u32> {
    fn foo() {}
}

impl<U: Fn()> Foo<u32, U> {
    fn foo() {}
}

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 28, 2018

OK, we'll have to take care of that then.

This comment has been minimized.

@michaelwoerister

michaelwoerister Dec 7, 2018

I had a little chat with @nikomatsakis about this yesterday and the outcome was that:

  • we always should encode type parameters in paths like this one and
  • we should also always encode parameter bounds in some form because there is no way to find out if they are needed for disambiguation without looking at other impls -- which we want to avoid. The bounds could be encoded in a numeric disambiguator though.

The consequences this has on symbol syntax should be small. We just have to find the best spot for adding parameter bounds.

This comment has been minimized.

@eddyb

eddyb Dec 7, 2018

Member

we should also always encode parameter bounds

I still think that's not ideal, and I'd prefer having a disambiguated path to the impl and/or to the type parameters (either of which would be hidden in the non-verbose mode).

struct Foo<T>(T);
impl<T> Clone for Foo<T> {
default fn clone<U>(_: U) {

This comment has been minimized.

@eddyb

eddyb Nov 27, 2018

Member

Similarly here, with the extraneous <U>.

<path-root> := <crate-id>
| M <type>
| X <type> <abs-path>

This comment has been minimized.

@rpjohnst

rpjohnst Nov 27, 2018

Bringing this comment up again: https://internals.rust-lang.org/t/pre-rfc-a-new-symbol-mangling-scheme/8501/4?u=rpjohnst. Would it make sense to move the trait's self type into its argument list? It reorders things from how they are displayed in error messages, but simplifies the grammar a bit.

This comment has been minimized.

@eddyb

eddyb Nov 27, 2018

Member

I agree, we already treat <X as Trait<Y, Z>> as sugar for Trait applied with [X, Y, Z], there's no real reason to have it separate here.

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 28, 2018

I didn't forget about the suggestion but unfortunately, while implementing it, it turned out that it makes the demangler a lot more complicated -- at least if we want to stick to the <X as Trait> demangling. If we mangle trait methods as foo::bar::Trait<SelfType, X, Y, Z>::method, the demangler cannot know that it is dealing with a trait method when it starts demangling the path at foo. It could only discover that when it gets to Trait and would then have to rewind and store the already generated output (foo::bar::Trait) on the heap, demangle the self-type, then copy back the trait path and continue demangling the trait's type arguments. It can only know that Trait is a trait if we put a special marker on the identifier, so traits would again be special cased. As a consequence, I thought, if we have to special case traits one way are the other, we can as well do it in a way that allows for efficient demangling and doesn't need the extra kind of logic.

The situation would be different if we actually wanted to demangle trait methods to foo::bar::Trait<SelfType, X, Y, Z>::method. But I don't think we want to do that, right?

This comment has been minimized.

@eddyb

eddyb Nov 28, 2018

Member

Oh, wait, an on-the-fly demangler needs to have everything in demangled order, right.
Is this only needed for <X as Trait<Y, Z>>, or are there other "out of order" constructs?

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 28, 2018

As far as I can tell <X as Trait<Y, Z>> is the only case.

```
### Items Within Specialized Trait Impls

This comment has been minimized.

@arielb1

arielb1 Nov 27, 2018

Contributor

Theoretically, you could also have stuff like this:

struct Foo<T>(T);

impl<T> Foo<T> where T: FnOnce() -> u32 {
    fn foo() {
        static ABC: u32 = 0;
    }
}

impl<T> Foo<T> where T: FnOnce() -> f32 {
    fn foo() {
        static ABC: u32 = 0;
    }
}

It is not supported by today's coherence, but it might be supported someday in the future.

I suppose that for now it is enough to also let this case use the <Foo<T>>'N format for either or both impls.

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 28, 2018

Yes, the RFC proposes to use a numeric disambiguator for keeping the two impls apart -- until specialization is finalized, at which point the disambiguator would be replaced with something more human-readable, which probably amounts to an encoding of the where clauses.

This comment has been minimized.

@arielb1

arielb1 Nov 28, 2018

Contributor

That code does not depend on specialization, as there is no overlap.

@michaelwoerister

This comment has been minimized.

michaelwoerister commented Nov 28, 2018

@m4b Let's discuss closures a bit. I want to get them fixed. The RFC proposes to demangle them as some::function::{closure}[3] where [3] means that it is there fourth closure within some::function. Am I correct in assuming that you don't find this readable enough and would prefer something like some::function::{closure at line 77}?

@michaelwoerister

This comment has been minimized.

michaelwoerister commented Nov 28, 2018

@bstrie Yes, I'm aware of these problems. I personally prefer to the indexing approach (I find C++'s {lambda()#1} notation quite appealing, actually). I'm still interested in hearing alternative suggestions.

@zackw

This comment has been minimized.

Contributor

zackw commented Nov 28, 2018

I’m an end-user interested in cross-language interop, and I have some experience with implementation of the Itanium C++ ABI. I would like to provide a few notes on this RFC.

  1. As I said in the pre-RFC discussion, I think the Rust mangling should be intentionally incompatible with all known C++ ABIs, because the linker should not resolve a call to extern void foo(int, int) from C++ as targeting a Rust function with the signature pub fn foo(i32, i32) -> () unless that function was specifically declared as extern "C++".

    The current proposal achieves this by using _R as the prefix for mangled names, and that’s plenty good enough, but I think it should be written down as a concrete reason not to go for C++ ABI compatibility.

  2. I think functions’ mangled names should always encode the full type signature of the function, even though Rust does not have function overloading. This would be a safety feature. It would trap cross-crate type mismatches between caller and callee at link time. (I have the impression this is supposed to be one of the purposes of crate disambiguators, but I do not trust them to do the job completely. Also, putting the type signatures into the mangled names would enable the linker to give better error messages.)

  3. The Unicode handling is underspecified. RFC 2457 and its stabilization issue indicate that issues like normalization, the subset of characters allowed in identifiers, etc. are being handled at the language level, but a demangler needs to know what to do with arbitrary nonsense produced by tools other than a correctly-implemented Rust compiler: e.g. a buggy Rust compiler, the extern "Rust" support in some other compiler (also arguably buggy, but still), and people writing assembly language by hand.

    I think this RFC should say that demanglers should check whether the result of decoding the punycode is a valid Rust identifier according to the rules that end up getting stabilized in the RFC 2457 process, and display the punycode string if it doesn’t qualify. For instance, _RN15mycrate_4a3b56d9godel_fgdu6escher4bachVE, which uses the NFD encoding of gödel, should be decoded as mycrate[4a3b56d]::[godel_fgd]::escher::bach.

    Also, please add a cross-reference to RFC 2457.

  4. In my opinion, the possibility of Unicode sequences that are not valid Rust identifiers being shoved into a mangled name by a tool that doesn’t follow all of the appropriate rules is a strong reason not to allow the use of raw UTF-8 in mangled names.

  5. ABI markers are simultaneously under- and overspecified. There’s what appears to be an exhaustive list of codes for calling conventions that are currently supported on any architecture, including several mutually exclusive groups, e.g. you’ll never need "m" and "i" on the same computer, I hope. And at the same time it doesn’t say what it means if the ABI marker is not present at all. I would suggest cutting this down drastically, e.g.

    // If the <abi> is not present, the function uses the usual Rust calling
    // convention for this architecture and OS.
    <abi> = "K" (
        "c" |     // Usual C calling convention for this arch and OS
        "j" |     // Rust intrinsic calling convention
        "i" |     // Interrupt handler for this architecture
        // Other single-lowercase-letter codes may be defined by each
        // architecture and OS; for instance, "s" could mean the Win32
        // "stdcall" convention.
    )
    

    Also, please add cross-references to the exact definitions of each of the calling conventions that are referenced by name.

  6. It appears to me that the N prefix and E suffix on <abs-path> are unnecessary, and the E suffix may also be unnecessary in several of the other places where it’s used. The Itanium C++ ABI only uses N ... E notation when it’s necessary to disambiguate “nested names,” e.g. when a name needs to be encoded as part of a template parameter.

Thanks for listening.

@eddyb

This comment has been minimized.

Member

eddyb commented Nov 28, 2018

It would trap cross-crate type mismatches between caller and callee at link time.

This can't handle all possible ABI-impacting details, only shallow ones, and on top of that, rustc already has better detection of incompatible crates, solving most, if not all, linking concerns.
C++ has this issue because of header files, but Rust doesn't have any equivalent features.

So to me, it seems like this would just increase symbol name size, without many (any?) benefits.

- A mangled symbol should be *decodable* to some degree. That is, it is desirable to be able to tell which exact concrete instance of e.g. a polymorphic function a given symbol identifies. This is true for external tools, backtraces, or just people only having the binary representation of some piece of code available to them. With the current scheme, this kind of information gets lost in the magical hash-suffix.
- It should be possible to predict the symbol name for a given source-level construct. For example, given the definition `fn foo<T>() { ... }`, the scheme should allow to construct, by hand, the symbol names for e.g. `foo<u32>` or `foo<extern fn(i32, &mut SomeStruct<(char, &str)>, ...) -> !>()`. Since the current scheme generates its hash from the values of various compiler internal data structures, not even an alternative compiler implementation could predicate the symbol name, even for simple cases.

This comment has been minimized.

@Wilfred

Wilfred Nov 28, 2018

Spelling pedantry: I think this should be predict

@rocallahan

This comment has been minimized.

rocallahan commented Nov 28, 2018

What justifies the additional complexity of the "does not itself add any new information" rule for node equivalence? Is this a microoptimization or does it make things easier to implement?

"j" | // RustInstrinsic
"p" | // PlatformInstrinsic
"u" // Unadjusted
)

This comment has been minimized.

@jsgf

jsgf Nov 29, 2018

This all seems a bit arbitrary. Given that in principle there could be an unbounded number of ABIs, it seems like we should splurge on using a real string here rather than a single character. I'm also going to guess that these will be relatively rare, so space isn't a consideration?

This comment has been minimized.

@eddyb

eddyb Nov 29, 2018

Member

I'd also favor encoding the ABI "string" (ideally as an identifier, replacing - with _, etc.)

This makes me wonder if Rust should've used extern(C) fn syntax instead of extern "C" fn, but it's too late now.

This comment has been minimized.

@michaelwoerister
"u" // Unadjusted
)
<disambiguator> = "s" [<base-62-digit>] "_"

This comment has been minimized.

@jsgf

jsgf Nov 29, 2018

Is this only a single digit? What if more than 62 things need disambiguation? I can imagine such things arising in generated code.
I'd propose {<base-62-digit>}.

This comment has been minimized.

@eddyb

eddyb Nov 29, 2018

Member

Meta-nit: in a post-regex world, I find EBNF somewhat unintuitive: it took me a while to even notice that by {...} you meant "replace ? with *", initially I thought you were talking about "{" ... "}".

cc @Centril (who started using "lyg" syntax instead)

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 29, 2018

Whoops, that's just a mistake in the grammar. It should be {<base-62-digit>} indeed.

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 29, 2018

Yeah, I don't really care which notation is used :P

<generic-arguments> = "I" {<type>} "E"
<substitution> = "S" [<base-62-digit>] "_"

This comment has been minimized.

@jsgf

jsgf Nov 29, 2018

Likewise seems safer to make this {...} (notwithstanding other comments about compression).

With this post-processing in place the Punycode strings can be treated like regular identifiers and need no further special handling.
## Compression

This comment has been minimized.

@jsgf

jsgf Nov 29, 2018

I'd be very tempted to omit this kind of ad-hoc compression scheme in favour of using a standard algorithm like zstd (say, zstd with a well-defined domain-specific dictionary). (cc @Cyan4973 - are there any examples of using zstd for per-symbol compression?)

Historically C++ demanglers have been very fragile, and I suspect a big part of that is due to the implementation complexity of the Itanium ABI compression mechanism.

Aside from the implementation issues, because this is so coupled with the definition of the mangling scheme itself, it means that any future evolution of mangling needs to also take compression into account. Using a completely separate compression layer makes this a non-issue. The other nice thing about making compression largely isolated from the rest of the encoding is that it means it can be added in a second pass as an extension once we have some experience with uncompressed mangling - maybe it wouldn't be so bad?

The main problem with using an external library is that any Rust demangler introduces another dependency. This is particularly worth considering when integrating Rust demangler support into other tools like binutils/llvm/perf/valgrind/etc.

This comment has been minimized.

@eddyb

eddyb Nov 29, 2018

Member

I believe we can come up with a simple compression scheme (i.e. refer back to the byte position where the first occurrence of something was encoded).

This would allow a demangler implementation to have 0 external dependencies, and the specification would also not implicitly depend on another standard.

It might also behave better than zstd given the limitation of [a-zA-Z0-9_] (which makes bit streams less appealing), and it has the advantage that any path component name is guaranteed to show up in clear.

However, I would not be opposed to at least having a compiler mangling option which disregards the [a-zA-Z0-9_] limitation and which does the best compression it can, for use in situations where that might be advantageous (although at that point, you might be better off with symbol names just, say, hashes, and keep everything else in split debuginfo, and/or an ad-hoc mapping from hashes to symbol names).

Oh and should definitely gather data on all compression schemes we can think of (that are not too painful to implement), before we accept the RFC!

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 29, 2018

I've listed the zstd option as point 5 in Rational and Alternatives. I would be interested in seeing how zstd performs. But it does come with some real downsides:

  • Every demangler would have to support zstd. That's another dependency that not everyone might want to pull in.
  • The specification of the mangling scheme would depend on the specification of zstd. I see that there's an IETF RFC for it. That's good. But it's still rather heavyweight.
  • Mangled symbol would not retain any human-readability at all.

I think one of the next steps would be to collect a body of symbol names for testing different compression schemes.

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 29, 2018

@jsgf I do acknowledge, btw, that an AST-independent compression scheme is clearly beneficial when it comes to evolving the grammar.

This comment has been minimized.

@jsgf

jsgf Dec 2, 2018

@michaelwoerister Yes, I think extra dependencies and non-readability reasonable counter-arguments to using zstd.

(waving hands wildly) I'm assuming that we wouldn't bother compressing small symbols, so they would remain directly readable, and large symbols with any compression scheme would be such a soup that even if the compression scheme leaves some parts "in the clear" they're still not directly readable in any practical sense.

### Punycode vs UTF-8
During the pre-RFC phase, it has been suggested that Unicode identifiers should be encoded as UTF-8 instead of Punycode on platforms that allow it. GCC, Clang, and MSVC seem to do this. The author of the RFC has a hard time making up their mind about this issue. Here are some interesting points that might influence the final decision:
- Using UTF-8 instead of Punycode would make mangled strings containing non-ASCII identifiers a bit more human-readable. For demangled strings, there would be no difference.

This comment has been minimized.

@jsgf

jsgf Nov 29, 2018

This is moot if compression (of any kind) is applied as well.

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 29, 2018

You mean because compressed names are unreadable in any case?

This comment has been minimized.

@jsgf
### Methods
Methods are nested within `impl` or `trait` items. As such it would be possible to construct their symbol names as paths like `my_crate::foo::{{impl}}::some_method` where `{{impl}}` somehow identifies the the `impl` in question. Since `impl`s don't have names, we'd have to use an indexing scheme like the one used for closures (and indeed, this is what the compiler does internally). Adding in generic arguments to, this would lead to symbol names looking like `my_crate::foo::impl'17::<u32, char>::some_method`.

This comment has been minimized.

@jsgf

jsgf Nov 29, 2018

Given that impls can appear anywhere within the crate, would the path be to the impl itself, or to the type being impled?

Do we need distinguish between different impls, or just impls with different constraints?

Given these questions, I think the proposal below to ignore impls themselves makes sense.

This comment has been minimized.

@eddyb

eddyb Nov 29, 2018

Member

The type being impled doesn't have to be a path, it can be e.g. [u8], so I think the safest thing to do would be to have both a path to the impl and the full type (and optionally trait) the impl is for.

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 29, 2018

The PR proposes to not include the path of the impl at all. @eddyb, you would rather demangle symbols to something like my_crate::foo::impl<u32, char>::some_method?

This comment has been minimized.

@eddyb

eddyb Dec 2, 2018

Member

I don't understand what Self and the Trait are in that example.
What I'm thinking is mangling the equivalent information of e.g. my_crate::foo::impl'17<my_crate::foo::S as my_crate::Trait>::some_method, demangling back to that only in verbose mode, but only showing <my_crate::foo::S as my_crate::Trait>::some_method in the "user-friendly" mode.

This comment has been minimized.

@michaelwoerister

michaelwoerister Dec 7, 2018

OK, yes that's what I thought you meant.

@jsgf

This comment has been minimized.

jsgf commented Nov 29, 2018

@zackw +1 on most of your points, but I think for 2. to matter it would mean that Rust compilation would have to change a lot. In practice with Rust code, one never sees linker errors for Rust symbols.

@michaelwoerister

This comment has been minimized.

michaelwoerister commented Nov 29, 2018

@rocallahan

What justifies the additional complexity of the "does not itself add any new information" rule for node equivalence? Is this a microoptimization or does it make things easier to implement?

There are two reasons for it:

  • Yes, it is a micro-optimization as it will allocate fewer indices and (for example) things that only occurred as a <path-prefix> before can substitute a whole <absolute-path> or <type>.
  • In a way it seems more intuitive when explaining how indices are allocated. Let's say we would not have this extended kind of equivalence. Then looking at an example like foo::Foo<bar::Bar>, you'd have to explain why bar::Bar corresponds to three substitution indices, one for the <type>, one for the <absolute-path>, and one for the <path-prefix>.

None of these reasons makes the optimization strictly necessary but since it's not hard to implement, it seemed like the better choice to include it.

@michaelwoerister

This comment has been minimized.

michaelwoerister commented Nov 29, 2018

Thanks for your detailed comments, @zackw! I'm going to respond to the different points individually.

  1. [...] The current proposal achieves this by using _R as the prefix for mangled names, and that’s plenty good enough, but I think it should be written down as a concrete reason not to go for C++ ABI compatibility.

Sounds good to me.

@michaelwoerister

This comment has been minimized.

michaelwoerister commented Nov 29, 2018

  1. I think functions’ mangled names should always encode the full type signature of the function

That would be quite a bit of additional data in each symbol. And as @eddyb says, rustc uses crate-metadata to catch incompatibilities. Here's an interesting data point: The symbol-hash, that is part of every symbol name in the current mangling scheme, includes the function parameters since at least 2015, if not longer. Yet, in all that time I personally have never seen a linker error catching a mismatch this way. So I am a bit doubtful whether encoding function parameters in symbol names would have a practical purpose (at least as far as safety is concerned).

While I think it would actually be kind of nice to have the parameter types, it is at odds with trying to keep symbol names short.

@michaelwoerister

This comment has been minimized.

michaelwoerister commented Nov 29, 2018

The Unicode handling is underspecified. RFC 2457 and its stabilization issue indicate that issues like normalization, [...]

I think I disagree here. In my opinion it is not the demangler's job to verify that identifiers conform to the Rust spec. It should just print something human-readable.

@michaelwoerister

This comment has been minimized.

michaelwoerister commented Nov 29, 2018

  1. ABI markers are simultaneously under- and overspecified.

Yes, sorry, they are still kind of a mess :) I just copied the current set from the compiler and assigned random letters to them. This needs fixing. Some thoughts:

  • As others have mentioned, it would be good for these to be extensible. We could just allow for an <identifier> production after the K.
  • I think it would still be good to assign single characters to common calling conventions, like extern "C".
  • I don't think per-architecture/OS encodings are a good idea because:
    • there are plenty of characters that we can pick from, so we are not short on encoding space, and
    • a symbol name does not encode which OS/architecture it was produced for. The demangler might not know either (i.e. if you just chucked a symbol into a web-based demangler, as an extreme example).
@michaelwoerister

This comment has been minimized.

michaelwoerister commented Nov 29, 2018

It appears to me that the N prefix and E suffix on are unnecessary, and the E suffix may also be unnecessary in several of the other places where it’s used. The Itanium C++ ABI only uses N ... E notation when it’s necessary to disambiguate “nested names,” e.g. when a name needs to be encoded as part of a template parameter.

I'll take a closer look at that.

@mark-i-m

This comment has been minimized.

Contributor

mark-i-m commented Nov 29, 2018

This is a great RFC :) Thank you!

Some thoughts:

  • It would be great if the mangling/demangling implementation could be made its own stable crate and published on crates.io. That way other tools can use it directly.
  • Not sure if this is possible, but it would be nice if the mangling/demangling implementation worked on no_std so that stack traces on any platform can be pretty (e.g. in a panic). I think this is mostly a question of having a no_std friendly parse-generator, but I'm not sure.
  • I appreciate that this RFC tries to keep names short. One niche usecase I have run into is that in the Linix kernel symbol names are limited to 128 bytes in length. I don't think it will ever be reasonable to limit the mangling scheme to 128B, but it helps that we do compression.
@mark-i-m

This comment has been minimized.

Contributor

mark-i-m commented Nov 29, 2018

Also, perhaps I missed this, but what about symbol names for things like struct fields and tuple elements? In optimized code, this wouldn't be a problem, but what about symbol tables for debuggers?

@m4b

This comment has been minimized.

m4b commented Nov 29, 2018

@michaelwoerister

Let's discuss closures a bit. I want to get them fixed. The RFC proposes to demangle them as some::function::{closure}[3] where [3] means that it is there fourth closure within some::function. Am I correct in assuming that you don't find this readable enough and would prefer something like some::function::{closure at line 77}?

I’m not in front of a computer otherwise I’d give more detailed example comparing c++ and rust, and I don’t know whether it’s our gdb implementation (I believe it isn’t because I switched gdbs language and it still had issues) but I’ve never been able to print or otherwise refer to anything with {{ or }} in their names.

In gdb one can do:

ptype 'foo::bar::<lambda>::operator() const' and it should be able to print (note the single quotes). So afaics gdb is ok with single quoted <> and (), but I’ve not tested [] in names.

For rust closures it errors out. Similar one cannot set a breakpoint for the same reason, or the various other commands, whatis, print, info sym/addr, etc.

I have had similar issues with lldb (though it does not offer a raw print similar to gdb single quotes), so I dunno; sometimes it’s hard to tell if the debugger doesn’t like the symbol or our debuginfo has issues (the latter I’ve seen in the past)

Closures, as well as I believe trait impls, have been a particular sore spot for debugging and rust for me.

As for their display, I don’t have a great many opinions on that per say. Mostly i would just like it to be some approximation of “reasonable”, but I’d really like to see {{closure}} anywhere in the name disappear as it does not currently allow printing. Whatever that ends up being replaced with, whether __closure__ or something else isn’t too important to me, as long as it’s printable.

As for line numbers and file name that was a random idea to make backtrackes more clear; another commenter suggested symbol names would change based on rustfmt which is a good argument to not do that imho; then again, I can also see motivations for not caring ?

I’d at least like some kind of identifying information (if possible) in the backtrace, and to be clear i think the index idea is great, but otherwise I would just like all {{ in symbol names to disappear since it doesn’t appear to play nice with the debuggers I’ve used.

@michaelwoerister

This comment has been minimized.

michaelwoerister commented Nov 30, 2018

Thanks a lot for the detailed response, @m4b!

I would also like to avoid line numbers in symbol names because it would mean that incremental compilation has to re-compile all closures if you change anything else that comes before in the same file. This is already the case when compiling with debuginfo, but we might be able to fix that in the future.

Other than that, I think we just have to find an encoding that fulfills the following requirements:

  • Cannot clash with regular identifiers. This rules out __closure__ but if push comes to shove, we might be able to relax this requirement to "is very unlikely to clash".
  • Does not trip up the parsers in debuggers. Maybe @tromey can give some insights here? I'll do some testing of how this works with C++. I think, debuggers have a Rust-specific parser which should give us some flexibility here. That being said, the RFC does not really define what demangled names look like. We just need to make sure that the mangled version contains all the information needed for demangling.
@michaelwoerister

This comment has been minimized.

michaelwoerister commented Nov 30, 2018

@mark-i-m

It would be great if the mangling/demangling implementation could be made its own stable crate and published on crates.io. That way other tools can use it directly.

That's the plan. See https://github.com/michaelwoerister/std-mangle-rs for the current reference implementation.

Not sure if this is possible, but it would be nice if the mangling/demangling implementation worked on no_std so that stack traces on any platform can be pretty (e.g. in a panic). I think this is mostly a question of having a no_std friendly parse-generator, but I'm not sure.

The only data-structure needed for demangling is a Vec, so we should be good here.

I appreciate that this RFC tries to keep names short. One niche usecase I have run into is that in the Linix kernel symbol names are limited to 128 bytes in length. I don't think it will ever be reasonable to limit the mangling scheme to 128B, but it helps that we do compression.

Yeah, since the mangled names must not lose information on identifiers and generic arguments of the input, and the input is basically unbounded, there is no way of limiting the length of mangled names in general.

@zackw

This comment has been minimized.

Contributor

zackw commented Dec 1, 2018

@michaelwoerister Replying to some of your notes, out of order:

In my opinion it is not the demangler's job to verify that identifiers conform to the Rust spec. It should just print something human-readable.

IMO it's not a matter of verification, it's a matter of ensuring unambiguous demangling. Suppose you have an object file that defines the ill-formed symbol _RN15mycrate_4a3b56d9godel_fgdu6escher4bachVE, it's important not to demangle that as mycrate[4a3b56d]::gödel::escher::bach, because that is the demangling of the well-formed symbol _RN15mycrate_4a3b56d8gdel_Fqau6escher4bachVE, and someone trying to track down a link error will be baffled as to why they don't match. Demanglers need to cope with ill-formed symbols, because demanglers are debugging aids, and that includes debugging the compiler.

ABI markers ...

  • I like the idea of allowing an arbitrary <identifier> after the K.
  • Shorthand notation for extern "C" is fine and all, but it's important for there to be only one name for each ABI. For instance, is there any difference between "c" and "x" on Linux/x86-64? If not, we should have only one of them.
  • I don't feel super strongly about this but, if you have a random symbol that you're pasting into a web demangler, is the distinction between "MSP430 interrupt handler" and "x86 interrupt handler" really important? You probably already know which CPU you are reverse engineering something from.

The symbol-hash, that is part of every symbol name in the current mangling scheme, includes the function parameters since at least 2015, if not longer. Yet, in all that time I personally have never seen a linker error catching a mismatch this way. So I am a bit doubtful whether encoding function parameters in symbol names would have a practical purpose (at least as far as safety is concerned).

I could live without including the full type signature, but I think both @eddyb and you are underestimating the probability of a type clash going undetected under this proposal, particularly when we start talking about cross-language interop and alternative implementations of the Rust compiler (which may have different sets of bugs).

Keep in mind that if there's a hash collision (which is the case where it would matter), you would not see a linker error; the program would appear to link successfully, and then malfunction at runtime. I don't know how the existing hashes are computed; I presume they are cryptographically strong. Per-symbol hashes are the full hash and they do include the type signature, so collisions are unlikely. But crate disambiguator hashes (whether or not they include the full type signature of each symbol) are truncated to only a few hex digits, which makes collisions possible again, even though the hash is strong. So the proposed scheme is significantly weaker at detecting this type of error than the current one.

@@ -49,7 +49,7 @@ A symbol mangling scheme has a few goals, one of them essential, the rest of the
- A mangled symbol should be *decodable* to some degree. That is, it is desirable to be able to tell which exact concrete instance of e.g. a polymorphic function a given symbol identifies. This is true for external tools, backtraces, or just people only having the binary representation of some piece of code available to them. With the current scheme, this kind of information gets lost in the magical hash-suffix.
- It should be possible to predict the symbol name for a given source-level construct. For example, given the definition `fn foo<T>() { ... }`, the scheme should allow to construct, by hand, the symbol names for e.g. `foo<u32>` or `foo<extern fn(i32, &mut SomeStruct<(char, &str)>, ...) -> !>()`. Since the current scheme generates its hash from the values of various compiler internal data structures, not even an alternative compiler implementation could predicate the symbol name, even for simple cases.
- It should be possible to predict the symbol name for a given source-level construct. For example, given the definition `fn foo<T>() { ... }`, the scheme should allow to construct, by hand, the symbol names for e.g. `foo<u32>` or `foo<extern fn(i32, &mut SomeStruct<(char, &str)>, ...) -> !>()`. Since the current scheme generates its hash from the values of various compiler internal data structures, not even an alternative compiler implementation could predict the symbol name, even for simple cases.

This comment has been minimized.

@eddyb

eddyb Dec 2, 2018

Member

We should avoid (accidentally) providing interop guarantees. A compiler should only be guaranteed to interop between artifacts it generated.

Unless what we're saying is that for a specific compiler version, an alternative compiler should be able to generate the same symbol names, but certain details might remain implementation-specific?

Still, it's likely of little use, given that the necessary compiler-specific metadata blob won't be compatible.

This comment has been minimized.

@eddyb

eddyb Dec 2, 2018

Member

Maybe we should keep the hash but make it optional, and have the demangler hide it?
Or at least enough bits of it (1-2 base62 digits) to make it unpredictable.
(and have it depend on the compiler version, just for good measure)

This comment has been minimized.

@eddyb

eddyb Dec 4, 2018

Member

Coming back to this, 2 new ideas:

  1. on the first build of a crate, generate some entropy, store it in crate and incremental metadata (so it can be reused until that crate is cleared)

  2. we don't really need to add an extra character or two from the hash if we make the crate hashes explicitly and intentionally unpredictable, uninteropable and generally implementation-defined

In general, we could use 1. to be less "absolutely" deterministic than we otherwise would be (e.g. TypeId), and avoid accidental interop between two different builds where Cargo happens to use the exact same rustc flags (-Cmetadata specifically) with the same source.

This comment has been minimized.

@zackw

zackw Dec 4, 2018

Contributor

if we make the crate hashes explicitly and intentionally unpredictable

Would this make the build nondeterministic, in the sense used by reproducible-builds.org ?

This comment has been minimized.

@eddyb

eddyb Dec 4, 2018

Member

@zackw Oh, I had forgotten about builds like that, I was only thinking of incremental.
In that case we can only include entropy specific to the rustc compiler executable (and that build would also have to be determinstic - can probably use the version, including the git hash).

In any case, I think it's enough to make sure there's no predictable way to go from -Cmetadata values to the crate hashes in the mangling, even if it has to be deterministic.

This comment has been minimized.

@michaelwoerister

michaelwoerister Dec 7, 2018

As long as this safeguard is:

  • encoded in a single component somewhere,
  • can be omitted in user-facing demanglings,
  • does not impede reproducible builds, and
  • can be removed from the grammar, should we ever want to define a stable ABI,

then I think we might be able do it?

This comment has been minimized.

@eddyb

eddyb Dec 7, 2018

Member

IMO this bulletpoint in the RFC should be rephrased and/or made an explicit anti-goal.

To summarize the comments above, I think it might be sufficient for the crate hashes to be unpredictable (even if deterministic for a given Rust implementation), if you only know the rustc command-line arguments passed by Cargo.

There's, however, a small risk of some tool extracting enough of these hashes to reuse them. For that, we should explicitly say that these are implementation-defined, change between versions, generally have no guarantees, and should not be used for interop.

If we do want to add some extra information, I suppose we can just not guarantee anything about the crate hashes and extend some of them with additional digits, or maybe generalize the scheme used to add hashes to crates and merge it with disambiguators (e.g. for impl)?

We can say disambiguators are implementation-defined in general, can't we?

EDIT: if we treat disambiguators like unspecified blobs, we could even use them for encoding the type information needed for some sort of sound "linking types" scheme (not exactly "stable ABI", but potentially replacing some of the major usecases).

@eddyb

This comment has been minimized.

Member

eddyb commented Dec 2, 2018

I think this is mostly a question of having a no_std friendly parse-generator, but I'm not sure.

I don't think we need a parser generator, it should be straight-forward to demangle symbols in a "handwritten LL(1)-like recursive-descent" fashion.
If we go with a compression scheme based on offsets and not AST indices, you won't even need allocation.

Keep in mind that if there's a hash collision (which is the case where it would matter), you would not see a linker error; the program would appear to link successfully, and then malfunction at runtime.

Wasn't the premise that there's a signature mismatch, making the hash distinct, not equal? Whereas with the new mangling a signature mismatch wouldn't cause a distinct symbol name?

particularly when we start talking about cross-language interop and alternative implementations of the Rust compiler

I'm not sure how those are relevant. I would very much like to see this RFC as "finding a good implementation detail for symbol names that can aid debugging", not anything like a stable type-driven linking ABI detail (such as is the case for C++).

The correctness of any tool should not, IMO, be tied to symbol names. Rust implementations should still be required to maintain their own internal mapping from a stronger identity to enough information to be able to generate the mangling and memory/call ABI details.
In fact, what I plan to implement will allow a per-crate choice of mangling scheme, with this RFC as the default, but alternatives available, including this RFC with different compression schemes, or more extreme versions such as the entire symbol name being one hash (I suspect projects which strip symbol names and debuginfo anyway will get something out of that last one).

EDIT: I am a bit scared of the prospect of people abusing debugging-friendly symbol mangling to get some sort of interop (see also #2603 (comment)), so one concern I will register once this RFC goes into FCP is to ensure that the wording gives us the ability to break such usecases at will.

@rocallahan

This comment has been minimized.

rocallahan commented Dec 2, 2018

In fact, what I plan to implement will allow a per-crate choice of mangling scheme, with this RFC as the default, but alternatives available

Proliferation of mangling schemes would cause problems for debuggers. Please don't.

You may tell developers that opting into non-standard mangling disables debugging, but people are going to ignore that thanks to the eternal optimism of the programmer. Yet eventually someone will want to debug them, and if those binaries can be demangled at all, debuggers will be pressured to support them.

@eddyb

This comment has been minimized.

Member

eddyb commented Dec 2, 2018

@rocallahan I don't expect we'll support more than one decompression scheme after we do enough benchmarking to settle on one. And a hash-only "mangling" wouldn't need any debugger support at all - it has no information and it's not meant to be debugged (short of having split debuginfo that is kept separate from e.g. the binaries being shipped to user of a game/application etc.).

EDIT: the point isn't that we want a plethora of options and variants, it's that the implementations (i.e. Rust compilers) use symbol names as an "output" (which is ideally in a format debuggers can turn into something human-readable), but not the "principal identity" of a definition - that must still be a separate concept, implementation-defined and subject to perma-unstable ABI (for now, at least).

@michaelwoerister

This comment has been minimized.

michaelwoerister commented Dec 4, 2018

I'm also against allowing more than one mangling scheme. If the compiler infrastructure allows to experiment with more than one compression scheme, that's great. But it should not be exposed to the end-user for the reasons @rocallahan gives.

@tromey

This comment has been minimized.

tromey commented Dec 6, 2018

Does not trip up the parsers in debuggers. Maybe @tromey can give some insights here? I'll do some testing of how this works with C++.

The debuggers can cope with nearly anything. gdb doesn't right now because the {{{ stuff in the DWARF seemed incorrect to me at the time, so I disabled the use of it (I haven't gone back to see if that was a good decision); but quoting can be done for linespecs and for expressions, as needed. For lldb I'm less sure about setting breakpoints, I'd have to look, but certainly for expressions it can operate however we like.

@michaelwoerister

This comment has been minimized.

michaelwoerister commented Dec 7, 2018

Thanks for the info, @tromey!

I think we'll want to stick to the index-based encoding for closures.

| "o" // u128
| "s" // i16
| "t" // u16
| "u" // ()

This comment has been minimized.

@eddyb

eddyb Dec 7, 2018

Member

Why is this special, as opposed to leaving it be TE (empty tuple)?

This comment has been minimized.

@michaelwoerister

michaelwoerister Dec 10, 2018

We could use the empty tuple encoding too, yes.

// The <decimal-number> specifies the encoding version.
<symbol-name> = "_R" [<decimal-number>] <absolute-path> [<instantiating-crate>]
<absolute-path> = "N" <path-prefix> [<generic-arguments>] "E"

This comment has been minimized.

@eddyb

eddyb Dec 7, 2018

Member

I guess these choices (N, E) and some of the basic types, happen to match the Itanium ABI?

I was thinking P might make more sense for paths, and I or J as a "closing bracket" (more so than E) but this is kind of a pointless bikeshed, since we don't expect people to read these themselves.

This comment has been minimized.

@michaelwoerister

michaelwoerister Dec 10, 2018

Yeah, I guess E makes sense as an "end" marker. Without a strong reason, I'd just stick to N and E. They are as good as any.

<path-prefix> = <identifier>
| "M" <type>
| "X" <type> <absolute-path> [<disambiguator>]

This comment has been minimized.

@eddyb

eddyb Dec 7, 2018

Member

These could use comments explaining what they look like.

This comment has been minimized.

@michaelwoerister
Mangled names conform to the following grammar:
```
// The <decimal-number> specifies the encoding version.

This comment has been minimized.

@eddyb

eddyb Dec 7, 2018

Member

Do we need this?

I guess it would be hard to remain backwards-compatible with demanglers when we add something, given how arbitrary all the rules are, but that doesn't mean we need to specify a way to upgrade the version, unless we want to change the meaning of the existing syntax?

Do you envision some special handling when a demangler sees a digit after _R?

This comment has been minimized.

@michaelwoerister

michaelwoerister Dec 10, 2018

A demangler could quickly determine that it cannot handle a given symbol. And it could quickly switch between different demangling algorithms based on the version.

It is a more complicated problem than I thought at first, though. I think we'll always try to change as little as possible when adding something. It still means that symbol using new features can't be demangled by older demanglers.

It's still good to have some version marker in there. If there was a radical change in the encoding, we could still keep the _R prefix this way.

@m4b

This comment has been minimized.

m4b commented Dec 10, 2018

A general worry/comment: it doesn’t appear so, but just to clarify, there aren’t any assumptions or details about this RFC which would inhibit or otherwise cause issues with stabilizing a rust ABI in some glorious future, yes?

Stabilizing and specifying symbol mangling and hence symbol names are one piece of that puzzle, so this is a great step forward, but just want to make sure that particular perspective is at least momentarily considered in case there’s something there waiting to cause trouble in the future :)

Again I don’t think so but just wanted to bring it up explicitly.

@michaelwoerister

This comment has been minimized.

michaelwoerister commented Dec 11, 2018

A general worry/comment: it doesn’t appear so, but just to clarify, there aren’t any assumptions or details about this RFC which would inhibit or otherwise cause issues with stabilizing a rust ABI in some glorious future, yes?

I don't think there's anything in here that blocks defining a stable ABI. It looks like some of the numeric disambiguator values will remain implementation defined for the time being but there's no inherent reason they can't be well-specified in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment