Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Symbol Mangling v2 #2603

Open
wants to merge 12 commits into
base: master
from

Conversation

Projects
None yet
@michaelwoerister
Copy link

michaelwoerister commented Nov 27, 2018

Rendered
Reference Implementation
Pre-RFC

Summary

This RFC proposes a new mangling scheme that describes what the symbol names generated by the Rust compiler. This new scheme has a number of advantages over the existing one which has grown over time without a clear direction. The new scheme is consistent, does not depend on compiler internals, and the information it stores in symbol names can be decoded again which provides an improved experience for users of external tools that work with Rust symbol names. The new scheme is based on the name mangling scheme from the [Itanium C++ ABI][itanium-mangling].

Motivation

Due to its ad-hoc nature, the compiler's current name mangling scheme has a
number of drawbacks:

  • It depends on compiler internals and its results cannot be replicated by another compiler implementation or external tool.
  • Information about generic parameters and other things is lost in the mangling process. One cannot extract the type arguments of a monomorphized function from its symbol name.
  • The current scheme is inconsistent: most paths use Itanium style encoding, but some don't.
  • The symbol names it generates can contain . characters which is not generally supported on all platforms. [1][2][3]

The proposed scheme solves these problems:

  • It is defined in terms of the language, not in terms of compiler data-structures that can change at any given point in time.
  • It encodes information about generic parameters in a reversible way.
  • It has a consistent definition that does not rely on pretty-printing certain language constructs.
  • It generates symbols that only consist of the characters A-Z, a-z, 0-9, and _.

This should make it easier for third party tools to work with Rust binaries.

@michaelwoerister michaelwoerister changed the title Symbol Mangling v2 RFC: Symbol Mangling v2 Nov 27, 2018

@eddyb

This comment has been minimized.

Copy link
Member

eddyb commented Nov 27, 2018

For the record, I'll be starting the compiler implementation/integration work ASAP, to get this RFC in rustc nightly, and later on, in other tools (such as GDB, LLDB, etc.).

Doing this at the same time as the RFC will give us the ability to collect data at scale, and figure out edge cases and performance tradeoffs we might miss otherwise.


### Methods

Methods are nested within `impl` or `trait` items. As such it would be possible to construct their symbol names as paths like `my_crate::foo::{{impl}}::some_method` where `{{impl}}` somehow identifies the the `impl` in question. Since `impl`s don't have names, we'd have to use an indexing scheme like the one used for closures (and indeed, this is what the compiler does internally). Adding in generic arguments to, this would lead to symbol names looking like `my_crate::foo::impl'17::<u32, char>::some_method`.

This comment has been minimized.

@eddyb

eddyb Nov 27, 2018

Member

In the interest of keeping this RFC sufficiently detached from current implementation details, can we use some more general placeholder notation, such as <impl>, instead of {{impl}}?

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 28, 2018

Author

It's just an example of how not to do it. The {{xyz}} notation is meant to remind of what some templating engines use, not what the compiler did at some point. But I can change it to <impl> if you prefer that.


- Identifiers and trait impl path roots can have a numeric disambiguator (the `<disambiguator>` production). The syntactic version of the numeric disambiguator maps to a numeric index. If the disambiguator is not present, this index is 0. If it is of the form `s_` then the index is 1. If it is of the form `s<base-62-digit>_` then the index is `<base-62-digit> + 2`. The suggested demangling of a disambiguator is `[<index>]`. However, for better readability, these disambiguators should usually be omitted in the demangling altogether. Disambiguators with index zero can always be omitted.

The exception here are closures. Since these do not have a name, the disambiguator is the only thing identifying them. The suggested demangling for closures is thus `{closure}[<index>]`.

This comment has been minimized.

@eddyb

eddyb Nov 27, 2018

Member

Similarly here, we should avoid braces. What does C++ do for its lambdas?

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 28, 2018

Author

GCC uses something with braces and indices too:

int square(int num) {
    auto foo = [num]() -> int { return num * num; };
    return foo();
}

The closure is demangled as square(int)::{lambda()#1}::operator()() const
(see https://godbolt.org/z/TaXWCe)

This comment has been minimized.

@eddyb

eddyb Nov 28, 2018

Member

Do debuggers work well with it? If so, how? Can we do some tests to see what works and what doesn't?

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 28, 2018

Author

I assume that debuggers treat lambdas as regular operator() methods. What kind of tests did you have in mind?

This comment has been minimized.

@eddyb

eddyb Nov 28, 2018

Member

I'm referring to the problems @m4b mentions in #2603 (comment), regarding debuggers not being able to let you refer to symbol names that contain { (or perhaps only {{?).

If we change {{closure}} in the compiler with some other notation, we can see how well GDB and LLDB interact with the symbol names.

Although it's possible debuggers only handle such symbol names when they come from a mangling, which would mean debuggers should just pick a demangling that works for them, right?


The exception here are closures. Since these do not have a name, the disambiguator is the only thing identifying them. The suggested demangling for closures is thus `{closure}[<index>]`.

- In a lossless demangling, identifiers from the value namespace should be marked with a `'` suffix in order to avoid conflicts with identifiers from the type namespace. In a user-facing demangling, where such conflicts are acceptable, the suffix can be omitted.

This comment has been minimized.

@eddyb

eddyb Nov 27, 2018

Member

Wouldn't that include all the statics and functions? Seems a bit excessive.

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 28, 2018

Author

It does, but I don't think there's a way around it. Otherwise you get conflicts for examples like:

fn foo() {
    fn bar() {}
}

mod foo {
    fn bar() {}
}

Note though that this is only for "lossless" demanglings. For most user-facing demanglings, like in debuggers or backtraces, the suffix can just be omitted. I suggest that demanglers support lossless or verbose option that is usually set to false.

struct Foo<T>(T);
impl<T> Clone for Foo<T> {
fn clone<U>(_: U) {

This comment has been minimized.

@eddyb

eddyb Nov 27, 2018

Member

Clone::clone can't take type parameters.

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 28, 2018

Author

Yeah, I'll come up with a better example.

This comment has been minimized.

@eternaleye

eternaleye Nov 28, 2018

Borrow<T> could work well

EDIT: Wait, no, missed you wanted the type param on the method.

}
```
- unmangled: `mycrate::Foo::bar::QUUX`
- mangled: `_RNMN11mycrate_xxx3FooE3barV4QUUXVE`

This comment has been minimized.

@eddyb

eddyb Nov 27, 2018

Member

Shouldn't these mention type parameters?

This comment has been minimized.

@arielb1

arielb1 Nov 27, 2018

Contributor

Sure enough. You want to be able to distinguish between these 2 cases (this code compiles today):

struct Foo<U,V>(U,V);

impl<U: Fn()> Foo<U, u32> {
    fn foo() {}
}

impl<U: Fn()> Foo<u32, U> {
    fn foo() {}
}

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 28, 2018

Author

OK, we'll have to take care of that then.

This comment has been minimized.

@michaelwoerister

michaelwoerister Dec 7, 2018

Author

I had a little chat with @nikomatsakis about this yesterday and the outcome was that:

  • we always should encode type parameters in paths like this one and
  • we should also always encode parameter bounds in some form because there is no way to find out if they are needed for disambiguation without looking at other impls -- which we want to avoid. The bounds could be encoded in a numeric disambiguator though.

The consequences this has on symbol syntax should be small. We just have to find the best spot for adding parameter bounds.

This comment has been minimized.

@eddyb

eddyb Dec 7, 2018

Member

we should also always encode parameter bounds

I still think that's not ideal, and I'd prefer having a disambiguated path to the impl and/or to the type parameters (either of which would be hidden in the non-verbose mode).

This comment has been minimized.

@michaelwoerister

michaelwoerister Dec 20, 2018

Author

Can you give your reasons why having the path to the impl is better than encoding the bounds? I assume because it's less complicated?

struct Foo<T>(T);
impl<T> Clone for Foo<T> {
default fn clone<U>(_: U) {

This comment has been minimized.

@eddyb

eddyb Nov 27, 2018

Member

Similarly here, with the extraneous <U>.

<path-root> := <crate-id>
| M <type>
| X <type> <abs-path>

This comment has been minimized.

@rpjohnst

rpjohnst Nov 27, 2018

Bringing this comment up again: https://internals.rust-lang.org/t/pre-rfc-a-new-symbol-mangling-scheme/8501/4?u=rpjohnst. Would it make sense to move the trait's self type into its argument list? It reorders things from how they are displayed in error messages, but simplifies the grammar a bit.

This comment has been minimized.

@eddyb

eddyb Nov 27, 2018

Member

I agree, we already treat <X as Trait<Y, Z>> as sugar for Trait applied with [X, Y, Z], there's no real reason to have it separate here.

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 28, 2018

Author

I didn't forget about the suggestion but unfortunately, while implementing it, it turned out that it makes the demangler a lot more complicated -- at least if we want to stick to the <X as Trait> demangling. If we mangle trait methods as foo::bar::Trait<SelfType, X, Y, Z>::method, the demangler cannot know that it is dealing with a trait method when it starts demangling the path at foo. It could only discover that when it gets to Trait and would then have to rewind and store the already generated output (foo::bar::Trait) on the heap, demangle the self-type, then copy back the trait path and continue demangling the trait's type arguments. It can only know that Trait is a trait if we put a special marker on the identifier, so traits would again be special cased. As a consequence, I thought, if we have to special case traits one way are the other, we can as well do it in a way that allows for efficient demangling and doesn't need the extra kind of logic.

The situation would be different if we actually wanted to demangle trait methods to foo::bar::Trait<SelfType, X, Y, Z>::method. But I don't think we want to do that, right?

This comment has been minimized.

@eddyb

eddyb Nov 28, 2018

Member

Oh, wait, an on-the-fly demangler needs to have everything in demangled order, right.
Is this only needed for <X as Trait<Y, Z>>, or are there other "out of order" constructs?

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 28, 2018

Author

As far as I can tell <X as Trait<Y, Z>> is the only case.

```


### Items Within Specialized Trait Impls

This comment has been minimized.

@arielb1

arielb1 Nov 27, 2018

Contributor

Theoretically, you could also have stuff like this:

struct Foo<T>(T);

impl<T> Foo<T> where T: FnOnce() -> u32 {
    fn foo() {
        static ABC: u32 = 0;
    }
}

impl<T> Foo<T> where T: FnOnce() -> f32 {
    fn foo() {
        static ABC: u32 = 0;
    }
}

It is not supported by today's coherence, but it might be supported someday in the future.

I suppose that for now it is enough to also let this case use the <Foo<T>>'N format for either or both impls.

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 28, 2018

Author

Yes, the RFC proposes to use a numeric disambiguator for keeping the two impls apart -- until specialization is finalized, at which point the disambiguator would be replaced with something more human-readable, which probably amounts to an encoding of the where clauses.

This comment has been minimized.

@arielb1

arielb1 Nov 28, 2018

Contributor

That code does not depend on specialization, as there is no overlap.

@michaelwoerister

This comment has been minimized.

Copy link
Author

michaelwoerister commented Nov 28, 2018

@m4b Let's discuss closures a bit. I want to get them fixed. The RFC proposes to demangle them as some::function::{closure}[3] where [3] means that it is there fourth closure within some::function. Am I correct in assuming that you don't find this readable enough and would prefer something like some::function::{closure at line 77}?

@michaelwoerister

This comment has been minimized.

Copy link
Author

michaelwoerister commented Nov 28, 2018

@bstrie Yes, I'm aware of these problems. I personally prefer to the indexing approach (I find C++'s {lambda()#1} notation quite appealing, actually). I'm still interested in hearing alternative suggestions.

@zackw

This comment has been minimized.

Copy link
Contributor

zackw commented Nov 28, 2018

I’m an end-user interested in cross-language interop, and I have some experience with implementation of the Itanium C++ ABI. I would like to provide a few notes on this RFC.

  1. As I said in the pre-RFC discussion, I think the Rust mangling should be intentionally incompatible with all known C++ ABIs, because the linker should not resolve a call to extern void foo(int, int) from C++ as targeting a Rust function with the signature pub fn foo(i32, i32) -> () unless that function was specifically declared as extern "C++".

    The current proposal achieves this by using _R as the prefix for mangled names, and that’s plenty good enough, but I think it should be written down as a concrete reason not to go for C++ ABI compatibility.

  2. I think functions’ mangled names should always encode the full type signature of the function, even though Rust does not have function overloading. This would be a safety feature. It would trap cross-crate type mismatches between caller and callee at link time. (I have the impression this is supposed to be one of the purposes of crate disambiguators, but I do not trust them to do the job completely. Also, putting the type signatures into the mangled names would enable the linker to give better error messages.)

  3. The Unicode handling is underspecified. RFC 2457 and its stabilization issue indicate that issues like normalization, the subset of characters allowed in identifiers, etc. are being handled at the language level, but a demangler needs to know what to do with arbitrary nonsense produced by tools other than a correctly-implemented Rust compiler: e.g. a buggy Rust compiler, the extern "Rust" support in some other compiler (also arguably buggy, but still), and people writing assembly language by hand.

    I think this RFC should say that demanglers should check whether the result of decoding the punycode is a valid Rust identifier according to the rules that end up getting stabilized in the RFC 2457 process, and display the punycode string if it doesn’t qualify. For instance, _RN15mycrate_4a3b56d9godel_fgdu6escher4bachVE, which uses the NFD encoding of gödel, should be decoded as mycrate[4a3b56d]::[godel_fgd]::escher::bach.

    Also, please add a cross-reference to RFC 2457.

  4. In my opinion, the possibility of Unicode sequences that are not valid Rust identifiers being shoved into a mangled name by a tool that doesn’t follow all of the appropriate rules is a strong reason not to allow the use of raw UTF-8 in mangled names.

  5. ABI markers are simultaneously under- and overspecified. There’s what appears to be an exhaustive list of codes for calling conventions that are currently supported on any architecture, including several mutually exclusive groups, e.g. you’ll never need "m" and "i" on the same computer, I hope. And at the same time it doesn’t say what it means if the ABI marker is not present at all. I would suggest cutting this down drastically, e.g.

    // If the <abi> is not present, the function uses the usual Rust calling
    // convention for this architecture and OS.
    <abi> = "K" (
        "c" |     // Usual C calling convention for this arch and OS
        "j" |     // Rust intrinsic calling convention
        "i" |     // Interrupt handler for this architecture
        // Other single-lowercase-letter codes may be defined by each
        // architecture and OS; for instance, "s" could mean the Win32
        // "stdcall" convention.
    )
    

    Also, please add cross-references to the exact definitions of each of the calling conventions that are referenced by name.

  6. It appears to me that the N prefix and E suffix on <abs-path> are unnecessary, and the E suffix may also be unnecessary in several of the other places where it’s used. The Itanium C++ ABI only uses N ... E notation when it’s necessary to disambiguate “nested names,” e.g. when a name needs to be encoded as part of a template parameter.

Thanks for listening.

@eddyb

This comment has been minimized.

Copy link
Member

eddyb commented Nov 28, 2018

It would trap cross-crate type mismatches between caller and callee at link time.

This can't handle all possible ABI-impacting details, only shallow ones, and on top of that, rustc already has better detection of incompatible crates, solving most, if not all, linking concerns.
C++ has this issue because of header files, but Rust doesn't have any equivalent features.

So to me, it seems like this would just increase symbol name size, without many (any?) benefits.


- A mangled symbol should be *decodable* to some degree. That is, it is desirable to be able to tell which exact concrete instance of e.g. a polymorphic function a given symbol identifies. This is true for external tools, backtraces, or just people only having the binary representation of some piece of code available to them. With the current scheme, this kind of information gets lost in the magical hash-suffix.

- It should be possible to predict the symbol name for a given source-level construct. For example, given the definition `fn foo<T>() { ... }`, the scheme should allow to construct, by hand, the symbol names for e.g. `foo<u32>` or `foo<extern fn(i32, &mut SomeStruct<(char, &str)>, ...) -> !>()`. Since the current scheme generates its hash from the values of various compiler internal data structures, not even an alternative compiler implementation could predicate the symbol name, even for simple cases.

This comment has been minimized.

@Wilfred

Wilfred Nov 28, 2018

Spelling pedantry: I think this should be predict

@rocallahan

This comment has been minimized.

Copy link

rocallahan commented Nov 28, 2018

What justifies the additional complexity of the "does not itself add any new information" rule for node equivalence? Is this a microoptimization or does it make things easier to implement?

"j" | // RustInstrinsic
"p" | // PlatformInstrinsic
"u" // Unadjusted
)

This comment has been minimized.

@jsgf

jsgf Nov 29, 2018

This all seems a bit arbitrary. Given that in principle there could be an unbounded number of ABIs, it seems like we should splurge on using a real string here rather than a single character. I'm also going to guess that these will be relatively rare, so space isn't a consideration?

This comment has been minimized.

@eddyb

eddyb Nov 29, 2018

Member

I'd also favor encoding the ABI "string" (ideally as an identifier, replacing - with _, etc.)

This makes me wonder if Rust should've used extern(C) fn syntax instead of extern "C" fn, but it's too late now.

This comment has been minimized.

@michaelwoerister
"u" // Unadjusted
)
<disambiguator> = "s" [<base-62-digit>] "_"

This comment has been minimized.

@jsgf

jsgf Nov 29, 2018

Is this only a single digit? What if more than 62 things need disambiguation? I can imagine such things arising in generated code.
I'd propose {<base-62-digit>}.

This comment has been minimized.

@eddyb

eddyb Nov 29, 2018

Member

Meta-nit: in a post-regex world, I find EBNF somewhat unintuitive: it took me a while to even notice that by {...} you meant "replace ? with *", initially I thought you were talking about "{" ... "}".

cc @Centril (who started using "lyg" syntax instead)

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 29, 2018

Author

Whoops, that's just a mistake in the grammar. It should be {<base-62-digit>} indeed.

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 29, 2018

Author

Yeah, I don't really care which notation is used :P

<generic-arguments> = "I" {<type>} "E"
<substitution> = "S" [<base-62-digit>] "_"

This comment has been minimized.

@jsgf

jsgf Nov 29, 2018

Likewise seems safer to make this {...} (notwithstanding other comments about compression).

With this post-processing in place the Punycode strings can be treated like regular identifiers and need no further special handling.


## Compression

This comment has been minimized.

@jsgf

jsgf Nov 29, 2018

I'd be very tempted to omit this kind of ad-hoc compression scheme in favour of using a standard algorithm like zstd (say, zstd with a well-defined domain-specific dictionary). (cc @Cyan4973 - are there any examples of using zstd for per-symbol compression?)

Historically C++ demanglers have been very fragile, and I suspect a big part of that is due to the implementation complexity of the Itanium ABI compression mechanism.

Aside from the implementation issues, because this is so coupled with the definition of the mangling scheme itself, it means that any future evolution of mangling needs to also take compression into account. Using a completely separate compression layer makes this a non-issue. The other nice thing about making compression largely isolated from the rest of the encoding is that it means it can be added in a second pass as an extension once we have some experience with uncompressed mangling - maybe it wouldn't be so bad?

The main problem with using an external library is that any Rust demangler introduces another dependency. This is particularly worth considering when integrating Rust demangler support into other tools like binutils/llvm/perf/valgrind/etc.

This comment has been minimized.

@eddyb

eddyb Nov 29, 2018

Member

I believe we can come up with a simple compression scheme (i.e. refer back to the byte position where the first occurrence of something was encoded).

This would allow a demangler implementation to have 0 external dependencies, and the specification would also not implicitly depend on another standard.

It might also behave better than zstd given the limitation of [a-zA-Z0-9_] (which makes bit streams less appealing), and it has the advantage that any path component name is guaranteed to show up in clear.

However, I would not be opposed to at least having a compiler mangling option which disregards the [a-zA-Z0-9_] limitation and which does the best compression it can, for use in situations where that might be advantageous (although at that point, you might be better off with symbol names just, say, hashes, and keep everything else in split debuginfo, and/or an ad-hoc mapping from hashes to symbol names).

Oh and should definitely gather data on all compression schemes we can think of (that are not too painful to implement), before we accept the RFC!

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 29, 2018

Author

I've listed the zstd option as point 5 in Rational and Alternatives. I would be interested in seeing how zstd performs. But it does come with some real downsides:

  • Every demangler would have to support zstd. That's another dependency that not everyone might want to pull in.
  • The specification of the mangling scheme would depend on the specification of zstd. I see that there's an IETF RFC for it. That's good. But it's still rather heavyweight.
  • Mangled symbol would not retain any human-readability at all.

I think one of the next steps would be to collect a body of symbol names for testing different compression schemes.

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 29, 2018

Author

@jsgf I do acknowledge, btw, that an AST-independent compression scheme is clearly beneficial when it comes to evolving the grammar.

This comment has been minimized.

@jsgf

jsgf Dec 2, 2018

@michaelwoerister Yes, I think extra dependencies and non-readability reasonable counter-arguments to using zstd.

(waving hands wildly) I'm assuming that we wouldn't bother compressing small symbols, so they would remain directly readable, and large symbols with any compression scheme would be such a soup that even if the compression scheme leaves some parts "in the clear" they're still not directly readable in any practical sense.

### Punycode vs UTF-8
During the pre-RFC phase, it has been suggested that Unicode identifiers should be encoded as UTF-8 instead of Punycode on platforms that allow it. GCC, Clang, and MSVC seem to do this. The author of the RFC has a hard time making up their mind about this issue. Here are some interesting points that might influence the final decision:

- Using UTF-8 instead of Punycode would make mangled strings containing non-ASCII identifiers a bit more human-readable. For demangled strings, there would be no difference.

This comment has been minimized.

@jsgf

jsgf Nov 29, 2018

This is moot if compression (of any kind) is applied as well.

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 29, 2018

Author

You mean because compressed names are unreadable in any case?

This comment has been minimized.

@jsgf

### Methods

Methods are nested within `impl` or `trait` items. As such it would be possible to construct their symbol names as paths like `my_crate::foo::{{impl}}::some_method` where `{{impl}}` somehow identifies the the `impl` in question. Since `impl`s don't have names, we'd have to use an indexing scheme like the one used for closures (and indeed, this is what the compiler does internally). Adding in generic arguments to, this would lead to symbol names looking like `my_crate::foo::impl'17::<u32, char>::some_method`.

This comment has been minimized.

@jsgf

jsgf Nov 29, 2018

Given that impls can appear anywhere within the crate, would the path be to the impl itself, or to the type being impled?

Do we need distinguish between different impls, or just impls with different constraints?

Given these questions, I think the proposal below to ignore impls themselves makes sense.

This comment has been minimized.

@eddyb

eddyb Nov 29, 2018

Member

The type being impled doesn't have to be a path, it can be e.g. [u8], so I think the safest thing to do would be to have both a path to the impl and the full type (and optionally trait) the impl is for.

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 29, 2018

Author

The PR proposes to not include the path of the impl at all. @eddyb, you would rather demangle symbols to something like my_crate::foo::impl<u32, char>::some_method?

This comment has been minimized.

@eddyb

eddyb Dec 2, 2018

Member

I don't understand what Self and the Trait are in that example.
What I'm thinking is mangling the equivalent information of e.g. my_crate::foo::impl'17<my_crate::foo::S as my_crate::Trait>::some_method, demangling back to that only in verbose mode, but only showing <my_crate::foo::S as my_crate::Trait>::some_method in the "user-friendly" mode.

This comment has been minimized.

@michaelwoerister

michaelwoerister Dec 7, 2018

Author

OK, yes that's what I thought you meant.

@jsgf

This comment has been minimized.

Copy link

jsgf commented Nov 29, 2018

@zackw +1 on most of your points, but I think for 2. to matter it would mean that Rust compilation would have to change a lot. In practice with Rust code, one never sees linker errors for Rust symbols.

@michaelwoerister

This comment has been minimized.

Copy link
Author

michaelwoerister commented Dec 7, 2018

Thanks for the info, @tromey!

I think we'll want to stick to the index-based encoding for closures.

| "o" // u128
| "s" // i16
| "t" // u16
| "u" // ()

This comment has been minimized.

@eddyb

eddyb Dec 7, 2018

Member

Why is this special, as opposed to leaving it be TE (empty tuple)?

This comment has been minimized.

@michaelwoerister

michaelwoerister Dec 10, 2018

Author

We could use the empty tuple encoding too, yes.

// The <decimal-number> specifies the encoding version.
<symbol-name> = "_R" [<decimal-number>] <absolute-path> [<instantiating-crate>]
<absolute-path> = "N" <path-prefix> [<generic-arguments>] "E"

This comment has been minimized.

@eddyb

eddyb Dec 7, 2018

Member

I guess these choices (N, E) and some of the basic types, happen to match the Itanium ABI?

I was thinking P might make more sense for paths, and I or J as a "closing bracket" (more so than E) but this is kind of a pointless bikeshed, since we don't expect people to read these themselves.

This comment has been minimized.

@michaelwoerister

michaelwoerister Dec 10, 2018

Author

Yeah, I guess E makes sense as an "end" marker. Without a strong reason, I'd just stick to N and E. They are as good as any.

<path-prefix> = <identifier>
| "M" <type>
| "X" <type> <absolute-path> [<disambiguator>]

This comment has been minimized.

@eddyb

eddyb Dec 7, 2018

Member

These could use comments explaining what they look like.

This comment has been minimized.

@michaelwoerister
Mangled names conform to the following grammar:

```
// The <decimal-number> specifies the encoding version.

This comment has been minimized.

@eddyb

eddyb Dec 7, 2018

Member

Do we need this?

I guess it would be hard to remain backwards-compatible with demanglers when we add something, given how arbitrary all the rules are, but that doesn't mean we need to specify a way to upgrade the version, unless we want to change the meaning of the existing syntax?

Do you envision some special handling when a demangler sees a digit after _R?

This comment has been minimized.

@michaelwoerister

michaelwoerister Dec 10, 2018

Author

A demangler could quickly determine that it cannot handle a given symbol. And it could quickly switch between different demangling algorithms based on the version.

It is a more complicated problem than I thought at first, though. I think we'll always try to change as little as possible when adding something. It still means that symbol using new features can't be demangled by older demanglers.

It's still good to have some version marker in there. If there was a radical change in the encoding, we could still keep the _R prefix this way.

@m4b

This comment has been minimized.

Copy link

m4b commented Dec 10, 2018

A general worry/comment: it doesn’t appear so, but just to clarify, there aren’t any assumptions or details about this RFC which would inhibit or otherwise cause issues with stabilizing a rust ABI in some glorious future, yes?

Stabilizing and specifying symbol mangling and hence symbol names are one piece of that puzzle, so this is a great step forward, but just want to make sure that particular perspective is at least momentarily considered in case there’s something there waiting to cause trouble in the future :)

Again I don’t think so but just wanted to bring it up explicitly.

@michaelwoerister

This comment has been minimized.

Copy link
Author

michaelwoerister commented Dec 11, 2018

A general worry/comment: it doesn’t appear so, but just to clarify, there aren’t any assumptions or details about this RFC which would inhibit or otherwise cause issues with stabilizing a rust ABI in some glorious future, yes?

I don't think there's anything in here that blocks defining a stable ABI. It looks like some of the numeric disambiguator values will remain implementation defined for the time being but there's no inherent reason they can't be well-specified in the future.

@michaelwoerister

This comment has been minimized.

Copy link
Author

michaelwoerister commented Dec 20, 2018

Since it has become clear that we need to either encode where clauses of impls or the fully qualified path of impls, and thus path roots that start with an inherent impl cannot simply be <type> productions, I'd like to propose a small change to the grammar: All impls, inherent or trait, shall be represent by 'X' <type> [<absolute-path>] [<disambiguator>], where <type> is the self-type, the optional <absolute-path> is the trait being implemented, and <disambiguator> is used to disambiguate impls that only differ in their parameter bounds. So the new syntax for paths would be:

<absolute-path> = "N" <path-prefix> [<generic-arguments>] "E"
                | <substitution>

// A path prefix is a chain of identifiers, with one of four possible roots:
//  - <identifier> encodes a crate-id
//  - "X" (...) encodes an impl (either trait or inherent)
//  - <absolute-path> for constructing paths with generic parameters in the middle,
//    `std::foo<u32>::{closure#2}`
//  - <substitution> is a prefix that has been seen before, introduced by compression
<path-prefix> = <identifier>                                
              | "X" <type> [<absolute-path>] [<disambiguator>]
              | <absolute-path>
              | <substitution>
              | <path-prefix> <identifier>

(@eddyb: The above assumes we are encoding parameter bounds just to keep things simple, we still should keep discussing the path-to-impl approach)

NOTE: Why wasn't this part of the initial syntax proposal?

Merging the syntactic forms for inherent and trait impls is possible because of a new assumption. While talking to @nikomatsakis I learned that all impls are supposed to be demangled to the <SelfType as Trait> form, where for inherent impls the as Trait part is just left off. For example the to_le method of isize would be printed as <isize>::to_le instead of isize::to_le. If we assume that all impls always start with < in the demangled form, the demangler now doesn't need to know upfront which kind of impl it is dealing with.

NOTE: The node equivalence rule for compression

While experimenting with the change to the grammar, I came to agree with what I think is @rocallahan's intuition about the extended AST node equivalence definition: That it is a micro-optimization that isn't quite worth the complexity it adds.
The main problem with it is that it needs to be re-evaluated anytime the grammar changes and that it makes reasoning about substitution candidates non-local. As a consequence, I am now for removing this extra rule and make compression be based purely on sub-tree equality. This might "waste" substitution index space but I doubt that it will make much of a difference for total symbol name length in the common case.

@eddyb

This comment has been minimized.

Copy link
Member

eddyb commented Jan 29, 2019

Various Rust / rustc-specific details this RFC should address

While implementing & testing the new symbol mangler, I found these, in no particular order:

  • rustc has more fine-grained disambiguation than type/value/closure
    • this means loss of uniqueness if disambiguator numbers are taken verbatim from rustc
    • while I disagree with the current sprawling state of DefPathData, this is more of a non-trivial long-term restructuring of the compiler and should not block this RFC
    • the extra information (e.g. distinguishing between a Trait and a Module with the same name and disambiguator) needs to be encoded in the disambiguator numbers (or separately)
  • crate names can start with decimal digits (not sure if this is intentional)
    • other than that, they're valid identifier names (after replacing - with _)
  • paths with generics on any segment (e.g. foo::<i32>::{closure})
  • forward compatibility with const generics
    • while not efficient for unsigned integers, we could adopt a length-prefixed format that manglers can ignore/print mangled if they don't know how to print constants of that type
  • polymorphic constants in array lengths (and const generics, in the future)
    • rustc can represent them even if they're broken (Waiting For Lazy Normalization)
  • more generally, various types that can't be normalized because of polymorphism
    • note that anything type/const-polymorphic in symbol names likely comes from a path prefix with unrelated generics, e.g. a monomorphic <Vec<T>>::method::<U>::STATIC
  • dyn Trait and its associated type binding syntax (e.g. dyn Iterator<Item = X>)
  • higher-ranked lifetimes(!)
    • not all lifetimes can be erased, some can be "bound" inside types, specifically:
      • for<'a, 'b> fn(...) (function pointers)
      • dyn for<'a, 'b> Trait<...> (trait objects)
    • e.g. see this TypeId test (every pair of distinct TypeIds means two separate TypeId::of instantiations and therefore two distinct symbol names)

Constrained demanglers

One thing I wanted to keep in mind is that ideally a demangler wouldn't need to work with integers larger than 64 bits to parse the mangled string.
However, this is not really a problem, since larger integer width can be emulated, or parts of the mangled string (e.g. disambiguators) shown verbatim if they are too large.

A more notable example is the rustc-demangle crate, which is no_std and the oldest Rust stable version it supports is 1.6 (i.e. the version that stabilized #![no_std]).
Ideally, we wouldn't break compatibility to support a new mangling scheme.

The main consequence of this is having to demangle without dynamic allocation, which runs into:

  • punycode requires random-access inserts into a codepoint buffer
    • however, it's still doable for some fixed number of codepoints, by using a fixed-size [char; N] buffer on the stack
    • printing the standard punycode format would let people use some other punycode decoder in edge cases (sadly online tools tend to also want the xn-- prefix)
    • alternatives: escaping individual bytes/codepoints (e.g. urlencode), or some sort of stateful compression (e.g. delta encoding unicode codepoints, into base62)
    • if we really like punycode for its effectiveness, we could even introduce an upper bound on the state you need by chunking the original string into groups of e.g. 64 codepoints
  • AST node count backreferences require O(nodes) memory usage
    • compression is unavoidable, and far more common than non-ASCII identifiers
    • fixed buffers could work (like punycode), but failure would be less graceful
    • this is also a problem for general-purpose algorithms (zstd was brought up)
      • with possible exception of on-the-fly demangling from a sliding window
    • alternatives: refer to relative/absolute byte positions in the whole symbol

Byte position backreferences

I prefer them over the "substitutions" from Itanium C++ ABI & this RFC, as they:

  • require no dynamic allocation (but can use it if desired)
  • do not depend on having a precise AST numbering scheme

However, there are some drawbacks:

  • if you do want to allocate to track backreferences, you need a map instead of a growable array (OTOH you know the range of possible backreferences ahead of time, so that could help)
  • to avoid having to encode both start and end byte positions for each backreference, the grammar needs to be modified to have only one unambiguous end position for any given start position (for anything that a backreference can refer to)

The grammar requirements result in more or less a "prefix notation", moving as much as possible to the start, and never ending with an optional/repetition, e.g.:

// This is fine - the repetition is followed by a mandatory "E"
<type> = "T" {<type>} "E" | ...
// This is a problem - it effectively ends in {<identifier>}
// (so it's clear on which identifier you should stop)
<path-prefix> = <path-prefix> <identifier> | ...

Grammar

Keeping in mind everything I've mentioned above, and trying to not change too much from this RFC (especially the more arbitrary choices), this is what I ended up with:

// The <decimal-number> specifies the encoding version.
<symbol-name> = "_R" [<decimal-number>] <path> [<instantiating-crate>]

<path> = "C" <identifier>               // crate root
       | "M" <impl-path> <type>         // <T> (inherent impl)
       | "X" <impl-path> <type> <path>  // <T as Trait> (trait impl)
       | "Y" <type> <path>              // <T as Trait> (trait definition)
       | "N" <ns> <path> <identifier>   // ...::ident (nested path)
       | "I" <path> {<generic-arg>} "E" // ...<T, U> (generic args)
       | <backref>

// Path to an impl (without the Self type or the trait).
// The <path> is the parent, while the <disambiguator> distinguishes
// between impls in that same parent (e.g. multiple impls in a mod).
// This exists as a simple way of ensure uniqueness, and demanglers
// don't need to show it (unless the location of the impl is desired).
<impl-path> = [<disambiguator>] <path>

// The <decimal-number> is the length of the identifier in bytes.
// <bytes> is the identifier itself and must not start with a decimal digit.
// If the "u" is present then <bytes> is Punycode-encoded.
<identifier> = [<disambiguator>] <undisambiguated-identifier>
<disambiguator> = "s" [<base-62-number>] "_"
<undisambiguated-identifier> = ["u"] <decimal-number> <bytes>

// Namespace of the identifier in a (nested) path.
// It's an a-zA-Z character, with a-z reserved for implementation-internal
// disambiguation categories (and demanglers should never show them), while
// A-Z are used for special namespaces (e.g. closures), which the demangler
// can show in a special way (e.g. `NC...` as `...::{closure}`), or just
// default to showing the uppercase character.
<ns> = "C"      // closure
     | "S"      // shim
     | <A-Z>    // other special namespaces
     | <a-z>    // internal namespaces

<generic-arg> = <lifetime>
              | <type>
              | "K" <const> // forward-compat for const generics

// An anonymous (numbered) lifetime, either erased or higher-ranked.
// Index 0 is always erased (can show as '_, if at all), while indices
// starting from 1 refer (as de Bruijn indices) to a higher-ranked
// lifetime bound by one of the enclosing <binder>s.
<lifetime> = "L" <base-62-number>

// Specify the number of higher-ranked (for<...>) lifetimes to bound.
// <lifetime> can then later refer to them, with lowest indices for
// innermost lifetimes, e.g. in `for<'a, 'b> fn(for<'c> fn(...))`,
// any <lifetime>s in ... (but not inside more binders) will observe
// the indices 1, 2, and 3 refer to 'c, 'b, and 'a, respectively.
<binder> = "G" <base-62-number>

<type> = <basic-type>
       | <path>                     // named type
       | "A" <type> <const>         // [T; N]
       | "S" <type>                 // [T]
       | "T" {<type>} "E"           // (T1, T2, T3, ...)
       | "R" [<lifetime>] <type>    // &T
       | "Q" [<lifetime>] <type>    // &mut T
       | "P" <type>                 // *const T
       | "O" <type>                 // *mut T
       | "F" <fn-sig>               // fn(...) -> ...
       | "D" <dyn-bounds> <lifetime>// dyn Trait<Assoc = X> + Send + 'a
       | <backref>

<basic-type> =      // original <basic-type> from the RFC, plus:
             | "p"  // placeholder (e.g. for generic params), shown as _

// If the "U" is present then the function is `unsafe`.
// The return type is always present, but demanglers can
// choose to omit the ` -> ()` by special-casing "u".
<fn-sig> := <binder> ["U"] ["K" <abi>] {<type>} "E" <type>

<abi> = "C"
      | <undisambiguated-identifier>

<dyn-bounds> = <binder> {<dyn-trait>} "E"
<dyn-trait> = <path> {<dyn-trait-assoc-binding>}
<dyn-trait-assoc-binding> = "p" <undisambiguated-identifier> <type>
<const> = <type> <const-data>
        | <type> "p" // placeholder (e.g. for polymorphic constants), shown as _: T
        | <backref>

// The encoding of a constant depends on its type, currently only
// unsigned integers (mainly usize, for arrays) are supported, and they
// use their value, in base 16 (0-9a-f), not their memory representation..
//
// Note that while exposing target-specific data layout information, such
// as pointer size, endianness, etc. should be avoided as much as possible,
// it might become necessary to include raw bytes, even whole allocation
// subgraphs (that miri created), for const generics with non-trivial types.
//
// However, demanglers could just show the raw encoding without trying to
// turn it into expressions, unless they're part of e.g. a debugger, with
// more information about the target data layout and/or from debuginfo.
<const-data> = {<hex-digit>} "_"

// <base-62-number> uses 0-9-a-z-A-Z as digits, i.e. 'a' is decimal 10 and
// 'Z' is decimal 61.
// "_" with no digits indicates the value 0, while any other value is offset
// by 1, e.g. "0_" is 1, "Z_" is 62, "10_" is 63, etc.
<base-62-number> = {<0-9a-zA-Z>} "_"

<backref> = "B" <base-62-number>

// We use <path> here, so that we don't have to add a special rule for
// compression. In practice, only a crate root is expected.
<instantiating-crate> = <path>

Stats

A complete mangler & demangler implementation means I can dump a lot of symbol names to compare several approaches, so I've done that for building libstd, rustc and Cargo (1.3GB of dumps).

This comment is getting long so I'm not going to embed the generated table, but you can find it here: https://gist.github.com/eddyb/786598131525ef5adc9189a30e31c2fc

Column explanation:

  • "old" is how Rust currently mangles symbols
    • "old+generics" is hackily printing type information into that
  • "mw" is @michaelwoerister's reference implementation
    • "mw+compress" has compression enabled, "mw" doesn't
  • "new" is my version of the mangling scheme (with the above grammar)
    • "new+compress" has compression enabled, "new" doesn't
  • the "▶" columns are for compression ratios between adjacent columns
    • e.g. how much smaller "new+compress" is than "new"
    • ratios are computed per-symbol, then combined into min/average/max
    • total ratio is between the total sizes (which is why it can drift from average)

Conclusion

I was expecting my "prefix notation" grammar to produce slightly larger symbols than @michaelwoerister's implementation. Instead, I'm surprised to see it's somewhat shorter.
I'll be posting rustc and rustc-demangle PRs shortly.
EDIT: mangler PR: rust-lang/rust#57967, demangler PR: alexcrichton/rustc-demangle#23

bors added a commit to rust-lang/rust that referenced this pull request Jan 29, 2019

Auto merge of #57967 - eddyb:rmangle, r=<try>
Introduce Rust symbol mangling scheme.

This is an implementation of a "feature-complete" Rust mangling scheme, in the vein of rust-lang/rfcs#2603 - but with some differences, see rust-lang/rfcs#2603 (comment) for details.

The demangling implementation PR is alexcrichton/rustc-demangle#23
(this PR already uses it via a git dependency, to allow testing).

Discussion of the *design* of the mangling scheme should still happen on the RFC, but this PR's specific implementation details can be reviewed in parallel.

<hr/>

*Notes for reviewers*:
* only the last 4 commits are specific to this branch, if necessary I can open a separate PR for everything else (it was meant to be its own small refactoring, but it got a bit out of hand)
* the "TEMPORARY" commit is only there because it does some extra validation (comparing the demangling from `rustc-demangle` to the compiler's pretty-printing, adjusted slightly to produce the same output), that I would like to try on crater
* there is the question of whether we should turn on the new mangling now, wait for tools to support it (I'm working on that), and/or have it under a `-Z` flag for now

r? @nikomatsakis / @michaelwoerister cc @rust-lang/compiler

bors added a commit to rust-lang/rust that referenced this pull request Jan 29, 2019

Auto merge of #57967 - eddyb:rmangle, r=<try>
Introduce Rust symbol mangling scheme.

This is an implementation of a "feature-complete" Rust mangling scheme, in the vein of rust-lang/rfcs#2603 - but with some differences, see rust-lang/rfcs#2603 (comment) for details.

The demangling implementation PR is alexcrichton/rustc-demangle#23
(this PR already uses it via a git dependency, to allow testing).

Discussion of the *design* of the mangling scheme should still happen on the RFC, but this PR's specific implementation details can be reviewed in parallel.

<hr/>

*Notes for reviewers*:
* only the last 6 commits are specific to this branch, if necessary I can open a separate PR for everything else (it was meant to be its own small refactoring, but it got a bit out of hand)
* the "TEMPORARY" commit is only there because it does some extra validation (comparing the demangling from `rustc-demangle` to the compiler's pretty-printing, adjusted slightly to produce the same output), that I would like to try on crater
* there is the question of whether we should turn on the new mangling now, wait for tools to support it (I'm working on that), and/or have it under a `-Z` flag for now

r? @nikomatsakis / @michaelwoerister cc @rust-lang/compiler
# Summary
[summary]: #summary

This RFC proposes a new mangling scheme that describes what the symbol names generated by the Rust compiler. This new scheme has a number of advantages over the existing one which has grown over time without a clear direction. The new scheme is consistent, does not depend on compiler internals, and the information it stores in symbol names can be decoded again which provides an improved experience for users of external tools that work with Rust symbol names. The new scheme is based on the name mangling scheme from the [Itanium C++ ABI][itanium-mangling].

This comment has been minimized.

@shepmaster

shepmaster Jan 30, 2019

Member

describes what the symbol names generated by the Rust compiler

I think you lost a word here.

(Meta note — if you hard-wrap your text to a certain number of columns, GH comments can be more specifically targeted)

This comment has been minimized.

@michaelwoerister

michaelwoerister Jan 30, 2019

Author

(Meta note — if you hard-wrap your text to a certain number of columns, GH comments can be more specifically targeted)

👍

This comment has been minimized.

@zackw

zackw Jan 30, 2019

Contributor

It's even better to put hard returns after each sentence and, for long sentences, each clause, but it takes some getting used to when you're writing it.


The function `foo` lives in the value namespaces while the module `foo` lives in the type namespace. They don't interfere. In order to make the symbol names for the two distinct `bar` functions unique, we thus add a suffix to name components in the value namespace, so case one would get the symbol name `N15mycrate_4a3b56d3fooV3barVE` and case two get the name `N15mycrate_4a3b56d3foo3barVE` (notice the difference: `3fooV` vs `3foo`).

There is on final case of name ambiguity that we have to take care of. Because of macro hygiene, multiple items with the same name can appear in the same context. The compiler internally disambiguates such names by augmenting them with a numeric index. For example, the first occurrence of the name `foo` within its parent is actually treated as `foo'0`, the second occurrence would be `foo'1`, the next `foo'2`, and so one. The mangling scheme will adopt this setup by appending a disambiguation suffix to each identifier with a non-zero index. So if macro expansion would result in the following code:

This comment has been minimized.

@shepmaster

shepmaster Jan 30, 2019

Member
Suggested change
There is on final case of name ambiguity that we have to take care of. Because of macro hygiene, multiple items with the same name can appear in the same context. The compiler internally disambiguates such names by augmenting them with a numeric index. For example, the first occurrence of the name `foo` within its parent is actually treated as `foo'0`, the second occurrence would be `foo'1`, the next `foo'2`, and so one. The mangling scheme will adopt this setup by appending a disambiguation suffix to each identifier with a non-zero index. So if macro expansion would result in the following code:
There is one final case of name ambiguity that we have to take care of. Because of macro hygiene, multiple items with the same name can appear in the same context. The compiler internally disambiguates such names by augmenting them with a numeric index. For example, the first occurrence of the name `foo` within its parent is actually treated as `foo'0`, the second occurrence would be `foo'1`, the next `foo'2`, and so one. The mangling scheme will adopt this setup by appending a disambiguation suffix to each identifier with a non-zero index. So if macro expansion would result in the following code:

### Unicode Identifiers

Rust allows Unicode identifiers but our character set is restricted to ASCII alphanumerics, and `_`. In order to transcode the former to the latter, we use the same approach as Swift, which is: encode all non-ascii identifiers via [Punycode][punycode], a standardized and efficient encoding that keeps encoded strings in a rather human-readable format. So for example, the string

This comment has been minimized.

@shepmaster

shepmaster Jan 30, 2019

Member
Suggested change
Rust allows Unicode identifiers but our character set is restricted to ASCII alphanumerics, and `_`. In order to transcode the former to the latter, we use the same approach as Swift, which is: encode all non-ascii identifiers via [Punycode][punycode], a standardized and efficient encoding that keeps encoded strings in a rather human-readable format. So for example, the string
Rust allows Unicode identifiers but our character set is restricted to ASCII alphanumerics, and `_`. In order to transcode the former to the latter, we use the same approach as Swift, which is: encode all non-ASCII identifiers via [Punycode][punycode], a standardized and efficient encoding that keeps encoded strings in a rather human-readable format. So for example, the string

This comment has been minimized.

@shepmaster

shepmaster Jan 30, 2019

Member

via Punycode

Punycode is defined in such a way that no two Unicode strings map to the same output, right?

What would these two map to?

fn gödel() {}
fn göödel() {}

This comment has been minimized.

@eddyb

eddyb Jan 30, 2019

Member

Punycode is defined in such a way that no two Unicode strings map to the same output, right?

I'm not sure what you're asking ("no two" distinct "Unicode strings"?). But for your example, this is what my implementation produces (in a crate named foo):

  • gödel: gdel_Fqa (_RNnCsbDqzXfLQacH_3foou8gdel_Fqa)
  • göödel: gdel_Fqaa (_RNnCsbDqzXfLQacH_3foou9gdel_Fqaa)

This comment has been minimized.

@shepmaster

shepmaster Feb 1, 2019

Member

Mostly I was (am?) confused by:

"Gödel, Escher, Bach" is encoded as "Gdel, Escher, Bach-d3b"

Because that looks like "Gödel" is encoded as "Gdel"(simply removing those characters) and thus "Göödel" would also be encoded as "Gdel". Based on your output, I see two things:

  1. The -d3b applies to the whole string, not just "Bach".
  2. This extra "stuff" (pardon my very technical jargon) around the string helps disambiguate things that would otherwise collide.

This comment has been minimized.

@eddyb

eddyb Feb 2, 2019

Member

Everything after the last - is a set of base36-encoded compressed instructions to insert arbitrary codepoints at arbitrary positions in the string before. (and if there's no ASCII initial string, the - is also missing)

There's no notion of "collision" as this is a perfectly lossless encoding, and it can't even use less than a base36 (ASCII) character to encode an Unicode character.

In my example, the second a in Fqaa is the most compressed encoding in punycode (it's 0 in their base36): "insert the same codepoint as the last one, just after the last insert position".

You can test this out on online punycode decoders, just keep in mind most of them also want a xn-- prefix (which is what DNS uses).

EDIT: oh and you'll also need to remap A-J to 0-9 because the RFC doesn't want the punycode base36 part to use any decimal digits (I'm not sure why, actually, it doesn't seem relevant to the grammar).

This comment has been minimized.

@CAD97

CAD97 Feb 3, 2019

How do you tell gödel encoded as gdel_Fqa apart from a literal gdel_Fqa? For demangling I guess it doesn't matter if you only have one and gdel_Fqa shows up as gödel, but when mangling this would introduce a collision point, wouldn't it?

To prevent non-international domain names containing hyphens from being accidentally interpreted as Punycode, international domain name Punycode sequences have a so-called ASCII Compatible Encoding (ACE) prefix, "xn--", prepended. [source]

This makes me think that the symbols will collide without some way to tell apart a punycoded name versus an un-punycoded name. (This could just be an extra trailing _ on everything to represent an empty punycode section.)

This comment has been minimized.

@eddyb

eddyb Feb 3, 2019

Member

The u flag indicates whether punycode encoding is being used.

EDIT: in my examples above, the u is before the length of the encoded identifier, i.e. u8gdel_Fqa and u9gdel_Fqaa.


### Compression/Substitution

The length of symbol names has an influence on how much work compiler, linker, and loader have to perform. The shorter the names, the better. At the same time, Rust's generics can lead to rather long names (which are often not visible in the code because of type inference and `impl Trait`). For example, the return type of the following function:

This comment has been minimized.

@shepmaster

shepmaster Jan 30, 2019

Member
Suggested change
The length of symbol names has an influence on how much work compiler, linker, and loader have to perform. The shorter the names, the better. At the same time, Rust's generics can lead to rather long names (which are often not visible in the code because of type inference and `impl Trait`). For example, the return type of the following function:
The length of symbol names has an influence on how much work the compiler, linker, and loader have to perform. The shorter the names, the better. At the same time, Rust's generics can lead to rather long names (which are often not visible in the code because of type inference and `impl Trait`). For example, the return type of the following function:
The length of symbol names has an influence on how much work compiler, linker, and loader have to perform. The shorter the names, the better. At the same time, Rust's generics can lead to rather long names (which are often not visible in the code because of type inference and `impl Trait`). For example, the return type of the following function:

```rust
fn quux(s: Vec<u32>) -> impl Iterator<Item=(u32, usize)> {

This comment has been minimized.

@shepmaster

shepmaster Jan 30, 2019

Member
Suggested change
fn quux(s: Vec<u32>) -> impl Iterator<Item=(u32, usize)> {
fn quux(s: Vec<u32>) -> impl Iterator<Item = (u32, usize)> {

This comment has been minimized.

@eddyb

eddyb Jan 30, 2019

Member

This is an interesting point, in trait objects, the compiler (and demangler I wrote) still print without spaces around the =, e.g. dyn Iterator<Item=String>.
Should we change it?

This comment has been minimized.

@Centril

Centril Jan 30, 2019

Contributor

Should we change it?

Yes. Try rustfmt on: fn foo() -> impl Iterator<Item=u8> { 0..10 }

This comment has been minimized.

@eddyb

eddyb Feb 4, 2019

Member

I've applied this change to the implementation PR.

```rust
fn quux(s: Vec<u32>) -> impl Iterator<Item=(u32, usize)> {
s.into_iter()
.map(|x| x+1)

This comment has been minimized.

@shepmaster

shepmaster Jan 30, 2019

Member
Suggested change
.map(|x| x+1)
.map(|x| x + 1)
std::iter::Once<(u32, usize)>>
```

It would make for a long symbol name if this types is used (maybe repeatedly) as a generic argument somewhere. C++ has the same problem with its templates; which is why the Itanium mangling introduces the concept of compression. If a component of a definition occurs more than once, it will not be repeated and instead be emitted as a substitution marker that allows to reconstruct which component it refers to. The scheme proposed here will use the same approach.

This comment has been minimized.

@shepmaster

shepmaster Jan 30, 2019

Member
Suggested change
It would make for a long symbol name if this types is used (maybe repeatedly) as a generic argument somewhere. C++ has the same problem with its templates; which is why the Itanium mangling introduces the concept of compression. If a component of a definition occurs more than once, it will not be repeated and instead be emitted as a substitution marker that allows to reconstruct which component it refers to. The scheme proposed here will use the same approach.
It would make for a long symbol name if this type is used (maybe repeatedly) as a generic argument somewhere. C++ has the same problem with its templates; which is why the Itanium mangling introduces the concept of compression. If a component of a definition occurs more than once, it will not be repeated and instead be emitted as a substitution marker that allows to reconstruct which component it refers to. The scheme proposed here will use the same approach.

It would make for a long symbol name if this types is used (maybe repeatedly) as a generic argument somewhere. C++ has the same problem with its templates; which is why the Itanium mangling introduces the concept of compression. If a component of a definition occurs more than once, it will not be repeated and instead be emitted as a substitution marker that allows to reconstruct which component it refers to. The scheme proposed here will use the same approach.

The exact scheme will be described in detail in the reference level explanation below but it roughly works as follows: As a mangled symbol name is being built or parsed, we build up a dictionary of "substitutions", that is we keep track of things a subsequent occurrence of which could be replaced by a substitution marker. The substitution marker is then the lookup key into this dictionary. The things that are eligible for substitution are (1) all prefixes of absolute paths (including the entire path itself) and (2) all types except for basic types. If a substitutable item is already present in the dictionary it does not generate a new key. Here's an example in order to illustrate the concept:

This comment has been minimized.

@shepmaster

shepmaster Jan 30, 2019

Member
Suggested change
The exact scheme will be described in detail in the reference level explanation below but it roughly works as follows: As a mangled symbol name is being built or parsed, we build up a dictionary of "substitutions", that is we keep track of things a subsequent occurrence of which could be replaced by a substitution marker. The substitution marker is then the lookup key into this dictionary. The things that are eligible for substitution are (1) all prefixes of absolute paths (including the entire path itself) and (2) all types except for basic types. If a substitutable item is already present in the dictionary it does not generate a new key. Here's an example in order to illustrate the concept:
The exact scheme will be described in detail in the reference level explanation below but it roughly works as follows: As a mangled symbol name is being built or parsed, we build up a dictionary of "substitutions", that is, we keep track of things a subsequent occurrence of which could be replaced by a substitution marker. The substitution marker is then the lookup key into this dictionary. The things that are eligible for substitution are (1) all prefixes of absolute paths (including the entire path itself) and (2) all types except for basic types. If a substitutable item is already present in the dictionary it does not generate a new key. Here's an example in order to illustrate the concept:
@michaelwoerister

This comment has been minimized.

Copy link
Author

michaelwoerister commented Feb 1, 2019

@eddyb I've taken a look at the grammar you are proposing and I really like the general approach! I need to actually work with it a little in order to get a feel for it and see if I can find issues that need clarifying, but I suspect it's already close to what we want. Thanks for all the work you've put into this!

@matthieu-m

This comment has been minimized.

Copy link

matthieu-m commented Feb 16, 2019

It is not clear to me for the RFC, so apologies if I am flogging a dead horse...

I would advise NOT to encode generic arguments when their default value is used:

  • it saves up considerable space.
  • it makes them much easier to read in debuggers.

As an example of what happens when the default value of generic arguments is encoded, consider this extreme C++ example godbolt link:

 //  What the developer writes:
#include <unordered_map>

std::size_t foo(std::unordered_map<int, int>& um) {
    return um.size();
}

 //  The demangled symbol produced:
 foo(std::unordered_map<int, int, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, int> > >&)

// The ideal demangled symbol produced:
foo(std::unordered_map<int, int>&)

The mangled/demangled symbol contains a lot of unnecessary information which bloats the binary, and wastes valuable display space/drowns out useful information when displayed.

I could see an argument for mangling the arguments anyway (backward compatibility in the face of a changing source code); and in this case would recommend some kind of marker that the value encoded is the default at the time of mangling so that it can be elided by debuggers (by default) when displaying the demangled symbol.

Note: with custom allocators coming to Rust, this is the difference between std::vec::Vec::<i32> and std::vec::Vec::<i32, <std::alloc::Allocator as std::alloc::Alloc>>.

@eddyb

This comment has been minimized.

Copy link
Member

eddyb commented Feb 17, 2019

I would advise NOT to encode generic arguments when their default value is used:

My apologies for not bringing that up. During the implementation, I made it impossible to even observe generic arguments identical to their defaults (previously, -Zverbose would shoe them), as it simplified the compiler-internal interface.

I definitely agree there is no value in them, for similar reason we don't encode e.g. the field types of a struct.

For an example of a use of default parameters today, see HashMap.

@michaelwoerister

This comment has been minimized.

Copy link
Author

michaelwoerister commented Feb 21, 2019

A little status update here: I've taken some to review @eddyb's proposed grammar (by writing a parser and demangler for it). And I really like it:

  • the syntax is very regular and consistent
  • compression and decompression are really simple
  • it is known to cover all the cases the compiler currently emits.

As a consequence I plan update the RFC over the next few days, switching to @eddyb's grammar. While doing that I'll also address the various comments and suggestions made above.

@michaelwoerister

This comment has been minimized.

Copy link
Author

michaelwoerister commented Feb 26, 2019

I have just pushed an updated version of the RFC that contains the grammar as proposed by @eddyb. I think I also addressed all nits and unresolved comments. Let me know if I forgot anything. There's a changelog at the end of the text.

Unresolved questions are:

  • Is the handling of constant data sufficient for now?
  • Should we allow to emit UTF-8 instead of Punycode? (I'm inclined to mandate support for Punycode but leave it to the compiler to decide when to use it).
  • Should we emit parameter types for function and method symbols as an additional safeguard against silent linker errors?

I did not follow up on zstd compression as we don't have any data on it and, while I find it intriguing, it has too many downsides in my opinion.

||||+------------------------- namespace tag for "foo"
|||+-------------------------- start-tag for "foo"
||+--------------------------- namespace tag for "bar"
|+---------------------------- start-tag for "bar"

This comment has been minimized.

@eddyb

eddyb Mar 15, 2019

Member

I would maybe describe these slightly differently, i.e. treating the nesting as if

mycrate::foo::bar

was really

((mycrate)::foo)::bar

which with the extra information becomes:

N('v', None, N('t', None, C(Some(1234), "mycrate"), "foo"), "bar")
<path> = C <identifier> // crate-id root
| M <impl-path> <type> // inherent impl root
| X <impl-path> <type> <path> // trait impl root
| N <namespace> <path> <identifier> // nested path

This comment has been minimized.

@eddyb

eddyb Mar 15, 2019

Member

Indentation of comments here seems off.

with its templates; which is why the Itanium mangling introduces the
concept of compression. If a component of a definition occurs more than
once, it will not be repeated and instead be emitted as a substitution
marker that allows to reconstruct which component it refers to. The

This comment has been minimized.

@eddyb

eddyb Mar 15, 2019

Member

Is "substitution" still applicable, or should we go with "backreference" everywhere?

```
The things that are eligible for substitution are (1) all prefixes of
paths (including the entire path itself), (2) all types except for
basic types, and (3) instances of const data.

This comment has been minimized.

@eddyb

eddyb Mar 15, 2019

Member

s/instances of const data/type-level constants (array lengths and values passed to const generic params)

cc @varkor for the exact naming here


```
The things that are eligible for substitution are (1) all prefixes of
paths (including the entire path itself), (2) all types except for

This comment has been minimized.

@eddyb

eddyb Mar 15, 2019

Member

For (1) I'd say something like "all paths (including nested paths / path prefixes)".

7---- 7---- 7----
5----------- 45---------
43--------------------
42-----------------------

This comment has been minimized.

@eddyb

eddyb Mar 15, 2019

Member

IIRC, my actually compression scheme uses 0 for the first byte after _R - in fact, this entire document could talk about the mangling as "the data after _R", and only specify in one place that it's encapsulated in "_R" <data> "" (_R prefix and no suffix).

Parsing, decompression, and demangling can thus be done in a single pass over the mangled name without the need to do dynamic allocation except for the dictionary array.
Parsing, decompression, and demangling can thus be done in a single pass
over the mangled name without the need for complex data structures, which
is useful when having to implement `#[no_std]` or C demanglers.

This comment has been minimized.

@eddyb

eddyb Mar 15, 2019

Member

This seems to be missing a note about punycode decoding.
(ignore details below if you've read my other descriptions of it)

It requires performing arbitrary-position insertions within a string buffer, driven by instructions decoded from a compressed "instruction stream", with no indication of the final size (no matter which encoding is picked).
Also, those instructions use "unicode scalar value" (i.e. Rust char) indexing, so something like Vec<char> needs to be used, and later shrunk to a String.

One can implement bounded punycode decoding on the stack (i.e. with [char; MAX_CHARS]), which gives up and prints the punycode encoding (or does something else) if the number of characters is larger than the constant bound (128 in my demangler implementation), but there are other workarounds possible (such as specifying a "chunked punycode" system, where the chars are split into groups of a fixed size, and are each compressed separately).

The RFC encodes constant values via the `<const-data> = {<hex-digit>} "_"`
production, where `{<hex-digit>}` is the numeric value of the constant, not
its representation as bytes. Using the numeric value is platform independent
but does not easily scale to non-integer data types.

This comment has been minimized.

@eddyb

eddyb Mar 15, 2019

Member

I think we can use the byte representation for non-scalars, although that would not be platform-independent anymore.

The tricky bit is really pointers, which we could handle like miri does...

But if we want to only allow #[structural_match] types (originally meant for pattern-matching), then we can do something slightly better, which is represent them in the way a pattern would be.

IIRC, miri already has logic for lifting a value into a pattern, producing primitive leaves (integers, bools, chars and strings IIRC) and nested refutable/irrefutable constructors (e.g. tuples, structs, enum variants) - cc @oli-obk.

So we could define an extended const encoding, without losing compatibility with this RFC, which uses non-primitive types in <const> to indicate printing TheType { 0: ..., 1: ..., ... } (not sure we should bother with field names in the mangling), where TheType is a path so we can even use it to refer to a variant, for enums!

Sadly, the <type> inside the <const> is awkwardly placed whichever way we do it, unless we don't really use a <type> but something that can support nesting.

If we want to stay maximally compatible with any scheme for non-primitive-type const generics, maybe we should use <basic-type> instead of <type> in <const>?
My demangler implementation already gives up if it's not an unsigned integer type anyway.

This comment was marked as off-topic.

@oli-obk

oli-obk Mar 15, 2019

Contributor

IRC, miri already has logic for lifting a value into a pattern, producing primitive leaves (integers, bools, chars and strings IIRC) and nested refutable/irrefutable constructors (e.g. tuples, structs, enum variants)

Actually, the pattern matching code has that logic somewhat (it's limited on purpose). It's very problematic to just arbitrarily expand constants into patterns. E.g. if you have a byte literal read from a file, it could have thousands of bytes. expanding that into a slice pattern of byte patterns is very memory and cpu intensive.

This comment was marked as off-topic.

@eddyb

eddyb Mar 15, 2019

Member

Aren't there byte literal patterns? Anyway, str and [u8] can be serialized as the bytes themselves (i.e. like a literal), no need to treat them like aggregates.

This comment was marked as off-topic.

@oli-obk

oli-obk Mar 15, 2019

Contributor

In order to recognize that the b"foo" pattern is unreachable, we need to do some sort of deaggregation. At least in the way the current exhaustiveness checks are written.

    match b"foo" {
        b"bar" => {},
        [b'f', b'o', x] => println!("{}", x),
        b"foo" => {}, // unreachable pattern
        _ => {}
    }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.