New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for `asm` (inline assembly) #29722

Open
aturon opened this Issue Nov 9, 2015 · 85 comments

Comments

Projects
None yet
@aturon
Member

aturon commented Nov 9, 2015

This issue tracks stabilization of inline assembly. The current feature has not gone through the RFC process, and will probably need to do so prior to stabilization.

@bstrie

This comment has been minimized.

Contributor

bstrie commented Nov 10, 2015

Will there be any difficulties with ensuring the backward-compatibility of inline assembly in stable code?

@bstrie

This comment has been minimized.

Contributor

bstrie commented Apr 8, 2016

@main-- has a great comment at rust-lang/rfcs#1471 (comment) that I'm reproducing here for posterity:

With all the open bugs and instabilities surrounding asm!() (there's a lot), I really don't think it's ready for stabilization - even though I'd love to have stable inline asm in Rust.

We should also discuss whether today's asm!() really is the best solution or if something along the lines of RFC #129 or even D would be better. One important point to consider here is that asm() does not support the same set of constraints as gcc. Therefore, we can either:

  • Stick to the LLVM behavior and write docs for that (because I've been unable to find any). Nice because it avoids complexity in rustc. Bad because it will confuse programmers coming from C/C++ and because some constraints might be hard to emulate in Rust code.
  • Emulate gcc and just link to their docs: Nice because many programmers already know this and there's plenty of examples one can just copy-paste with little modifications. Bad because it's a nontrivial extension to the compiler.
  • Do something else (like D does): A lot of work that may or may not pay off. If done right, this could be vastly superior to gcc-style in terms of ergonomics while possibly integrating more nicely with language and compiler than just an opaque blob (lots of handwaving here as I'm not familiar enough with compiler internals to assess this).

Finally, another thing to consider is #1201 which in its current design (I think) depends quite heavily on inline asm - or inline asm done right, for that matter.

@briansmith

This comment has been minimized.

briansmith commented Apr 8, 2016

I personally think it would be better to do what Microsoft did in MSVC x64: define a (nearly-)comprehensive set of intrinsic functions, for each asm instruction, and do "inline asm" exclusively through those intrinsics. Otherwise, it's very difficult to optimize the code surrounding inline asm, which is ironic since many uses of inline asm are intended to be performance optimizations.

One advantage of the instrinsic-based approach is that it doesn't need to be an all-or-nothing thing. You can define the most needed intrinsics first, and build the set out incrementally. For example, for crypto, having _addcarry_u64, _addcarry_u32. Note that the work to do the instrinsics seems to have been done quite thoroughly already: https://github.com/huonw/llvmint.

Further, the intrinsics would be a good idea to add even if it were ultimately decided to support inline asm, as they are much more convenient to use (based on my experience using them in C and C++), so starting with the intrinsics and seeing how far we get seems like a zero-risk-of-being-wrong thing.

@cuviper

This comment has been minimized.

Member

cuviper commented Apr 8, 2016

Intrinsics are good, but asm! can be used for more than just inserting instructions.
For example, see the way I'm generating ELF notes in my probe crate.
https://github.com/cuviper/rust-libprobe/blob/master/src/platform/systemtap.rs

I expect that kind of hackery will be rare, but I think it's still a useful thing to support.

@arielb1

This comment has been minimized.

Contributor

arielb1 commented Apr 9, 2016

@briansmith

Inline asm is also useful for code that wants to do its own register/stack allocation (e.g. naked functions).

@Ericson2314

This comment has been minimized.

Contributor

Ericson2314 commented Apr 9, 2016

@briansmith yeah those are some excellent reasons to use intrinsics where possible. But it's nice to have inline assembly as the ultimate excape hatch.

@main--

This comment has been minimized.

main-- commented Apr 9, 2016

@briansmith Note that asm!() is kind of a superset of intrinsics as you can build the latter using the former. (The common argument against this reasoning is that the compiler could theoretically optimize through intrinsics, e.g. hoist them out of loops, run CSE on them, etc. However, it's a pretty strong counterpoint that anyone writing asm for optimization purposes would do a better job at that than the compiler anyways.) See also #29722 (comment) and #29722 (comment) for cases where inline asm works but intrinsics don't.

On the other hand, intrinsics critically depend on a "sufficiently smart compiler" to achieve at least the performance one would get with a hand-rolled asm implementation. My knowledge on this is outdated but unless there has been significant progress, intrinsics-based implementations are still measurably inferior in many - if not most - cases. Of course they're much more convenient to use but I'd say that programmers really don't care much about that when they're willing to descend into the world of specific CPU instructions.

Now another interesting consideration is that intrinsics could be coupled with fallback code on architectures where they're not supported. This gives you the best of both worlds: Your code is still portable - it can just employ some hardware accelerated operations where the hardware supports them. Of course this only really pays off for either very common instructions or if the application has one obvious target architecture. Now the reason why I'm mentioning this is that while one could argue that this may potentially even be undesirable with compiler-provided intrinsics (as you'd probably care about whether you actually get the accelerated versions plus compiler complexity is never good) I'd say that it's a different story if the intrinsics are provided by a library (and only implemented using inline asm). In fact, this is the big picture I'd prefer even though I can see myself using intrinsics more than inline asm.

(I consider the intrinsics from RFC #1199 somewhat orthogonal to this discussion as they exist mostly to make SIMD work.)

@jimblandy

This comment has been minimized.

Contributor

jimblandy commented May 19, 2016

@briansmith

Otherwise, it's very difficult to optimize the code surrounding inline asm, which is ironic since many uses of inline asm are intended to be performance optimizations.

I'm not sure what you mean here. It's true that the compiler can't break down the asm into its individual operations to do strength reduction or peephole optimizations on it. But in the GCC model, at least, the compiler can allocate the registers it uses, copy it when it replicates code paths, delete it if it's never used, and so on. If the asm isn't volatile, GCC has enough information to treat it like any other opaque operation like, say, fsin. The whole motivation for the weird design is to make inline asm something the optimizer can mess with.

But I haven't used it a whole lot, especially not recently. And I have no experience with LLVM's rendition of the feature. So I'm wondering what's changed, or what I've misunderstood all this time.

@alexcrichton

This comment has been minimized.

Member

alexcrichton commented Jul 5, 2017

We discussed this issue at the recent work week as @japaric's survey of the no_std ecosystem has the asm! macro as one of the more commonly used features. Unfortunately we didn't see an easy way forward for stabilizing this feature, but I wanted to jot down the notes we had to ensure we don't forget all this.

  • First, we don't currently have a great specification of the syntax accepted in the asm! macro. Right now it typically ends up being "look at LLVM" which says "look at clang" which says "look at gcc" which doesn't have great docs. In the end this typically bottoms out at "go read someone else's example and adapt it" or "read LLVM's source code". For stabilization a bare minimum is that we need to have a specification of the syntax and documentation.

  • Right now, as far as we know, there's no stability guarantee from LLVM. The asm! macro is a direct binding to what LLVM does right now. Does this mean that we can still freely upgrade LLVM when we'd like? Does LLVM guarantee it'll never ever break this syntax? A way to alleviate this concern would be to have our own layer that compiles to LLVM's syntax. That way we can change LLVM whenever we like and if the implementation of inline assembly in LLVM changes we can just update our translation to LLVM's syntax. If asm! is to become stable we basically need some mechanism of guaranteeing stability in Rust.

  • Right now there are quite a few bugs related to inline assembly. The A-inline-assembly tag is a good starting point, and it's currently littered with ICEs, segfaults in LLVM, etc. Overall this feature, as implemented today, doesn't seem to live up to the quality guarantees others expect from a stable feature in Rust.

  • Stabilizing inline assembly may make an implementation of an alternate backend very difficult. For example backends such as miri or cranelift may take a very long time to reach feature parity with the LLVM backend, depending on the implementation. This may mean that there's a smaller slice of what can be done here, but it's something important to keep in mind when considering stabilizing inline assembly.


Despite the issues listed above we wanted to be sure to at least come away with some ability to move this issue forward! To that end we brainstormed a few strategies of how we can nudge inline assembly towards stabilization. The primary way forward would be to investigate what clang does. Presumably clang and C have effectively stable inline assembly syntax and it may be likely that we can just mirror whatever clang does (especially wrt LLVM). It would be great to understand in greater depth how clang implements inline assembly. Does clang have its own translation layer? Does it validate any input parameters? (etc)

Another possibility for moving forward is to see if there's an assembler we can just take off the shelf from elsewhere that's already stable. Some ideas here were nasm or the plan9 assembler. Using LLVM's assembler has the same problems about stability guarantees as the inline assembly instruction in the IR. (it's a possibility, but we need a stability guarantee before using it)

@Amanieu

This comment has been minimized.

Contributor

Amanieu commented Jul 5, 2017

I would like to point out that LLVM's inline asm syntax is different from the one used by clang/gcc. Differences include:

  • LLVM uses $0 instead of %0.
  • LLVM doesn't support named asm operands %[name].
  • LLVM supports different register constraint types: for example "{eax}" instead of "a" on x86.
  • LLVM support explicit register constraints ("{r11}"). In C you must instead use register asm variables to bind a value to a register (register asm("r11") int x).
  • LLVM "m" and "=m" constraints are basically broken. Clang translates these into indirect memory constraints "*m" and "=*m" and pass the address of the variable to LLVM instead of the variable itself.
  • etc...

Clang will convert inline asm from the gcc format into the LLVM format before passing it on to LLVM. It also performs some validation of the constraints: for example it ensures that "i" operands are compile-time constants,


In light of this I think that we should implement the same translation and validation that clang does and support proper gcc inline asm syntax instead of the weird LLVM one.

@alexcrichton

This comment has been minimized.

Member

alexcrichton commented Jul 14, 2017

There's an excellent video about summaries with D, MSVC, gcc, LLVM, and Rust with slides online

@jcranmer

This comment has been minimized.

jcranmer commented Jul 20, 2017

As someone who'd love to be able to use inline ASM in stable Rust, and with more experience than I want trying to access some of the LLVM MC APIs from Rust, some thoughts:

  • Inline ASM is basically a copy-paste of a snippet of code into the output .s file for assembling, after some string substitution. It also has attachments of input and output registers as well as clobbered registers. This basic framework is unlikely to ever really change in LLVM (although some of the details might vary slightly), and I suspect that this is a fairly framework-independent representation.

  • Constructing a translation from a Rust-facing specification to an LLVM-facing IR format isn't hard. And it might be advisable--the rust {} syntax for formatting doesn't interfere with assembly language, unlike LLVM's $ and GCCs % notation.

  • LLVM does a surprisingly bad job in practice of actually identifying which registers get clobbered, particularly in instructions not generated by LLVM. This means it's pretty much necessary for the user to manually specify which registers get clobbered.

  • Trying to parse the assembly yourself is likely to be a nightmare. The LLVM-C API doesn't expose the MCAsmParser logic, and these classes are quite annoying to get working with bindgen (I've done it).

  • For portability to other backends, as long as you keep the inline assembly mostly on the level of "copy-paste this string with a bit of register allocation and string substitution", it shouldn't inhibit backends all that much. Dropping the integer constant and memory constraints and keeping just register bank constraints shouldn't pose any problems.

@parched

This comment has been minimized.

Contributor

parched commented Sep 6, 2017

I've been having a bit of play to see what can be done with procedural macros. I've written one that converts GCC style inline assembly to rust style https://github.com/parched/gcc-asm-rs. I've also started working on one that uses a DSL where the user doesn't have to understand the constraints and they're all handled automatically.

So I've come to the conclusion that I think rust should just stabilise the bare building blocks, then the community can iterate out of tree with macros to come up with best solutions. Basically, just stabilise the llvm style we have now with only "r" and "i" and maybe "m" constraints, and no clobbers. Other constraints and clobbers can be stabilised later with their own mini rfc type things.

@bstrie

This comment has been minimized.

Contributor

bstrie commented Oct 3, 2017

Personally I'm starting to feel as though stabilizing this feature is the sort of massive task that will never get done unless somehow someone hires a full-time expert contractor to push on this for a whole year. I want to believe that @parched's suggestion of stabilizing asm! piecemeal will make this tractable. I hope someone picks it up and runs with it. But if it isn't, then we need to stop trying to reach for the satisfactory solution that will never arrive and reach for the unsatisfactory solution that will: stabilize asm! as-is, warts, ICEs, bugs and all, with bright bold warnings in the docs advertising the jank and nonportability, and with the intent to deprecate someday if a satisfactory implementation should ever miraculously descend, God-sent, on its heavenly host. IOW, we should do exactly what we did for macro_rules! (and of course, just like for macro_rules!, we can have a brief period of frantic band-aiding and leaky future-proofing). I'm sad at the ramifications for alternative backends, but it's shameful for a systems language to relegate inline assembly to such a limbo, and we can't let the hypothetical possibility of multiple backends continue to obstruct the existence of one actually usable backend. I beg of you, prove me wrong!

@main--

This comment has been minimized.

main-- commented Oct 4, 2017

it's shameful for a systems language to relegate inline assembly to such a limbo

As a data point, I happen to be working on a crate right now that depends on gcc for the sole purpose of emitting some asm with stable Rust: https://github.com/main--/unwind-rs/blob/266e0f26b6423f4a2b8a8c72442b319b5c33b658/src/unwind_helper.c


While it certainly has its advantages, I'm a bit wary of the "stabilize building blocks and leave the rest to proc-macros"-approach. It essentially outsources the design, RFC and implementation process to whoever wants to do the job, potentially no one. Of course having weaker stability/quality guarantees is the entire point (the tradeoff is that having something imperfect is already much better than having nothing at all), I understand that.

At least the building blocks should be well-designed - and in my opinion, "expr" : foo : bar : baz definitely isn't. I can't remember ever getting the order right on the first try, I always have to look it up. "Magic categories separated by colons where you specify constant strings with magic characters that end up doing magic things to the variable names that you also just mash in there somehow" is just bad.

@nbp

This comment has been minimized.

Contributor

nbp commented Oct 4, 2017

One idea, …

Today, there is already a project, named dynasm, which can help you generate assembly code with a plugin used to pre-process the assembly with one flavor of x64 code.

This project does not answer the problem of inline assembly, but it can certainly help, if rustc were to provide a way to map variables to registers, and accept to insert set of bytes in the code, such project could also be used to fill-up these set of bytes.

This way, the only standardization part needed from rustc point of view, is the ability to inject any byte sequence in the generated code, and to enforce specific register allocations. This removes all the choice for specific languages flavors.

Even without dynasm, this can also be used as a way to make macros for the cpuid / rtdsc instructions, which would just be translated into the raw sequence of bytes.

I guess the next question might be if we want to add additional properties/constraints to the byte-sequences.

@jimblandy

This comment has been minimized.

Contributor

jimblandy commented Oct 4, 2017

[EDIT: I don't think anything I said in this comment is correct.]

If we want to continue to use LLVM's integrated assembler (I assume this is faster than spawning an external assembler), then stabilization means stabilizing on exactly what LLVM's inline assembly expressions and integrated assembler support—and compensating for changes to those, should any occur.

If we're willing to spawn an external assembler, then we can use any syntax we want, but we're then foregoing the advantages of the integrated assembler, and exposed to changes in whatever external assembler we're calling.

@cuviper

This comment has been minimized.

Member

cuviper commented Oct 4, 2017

I think it would be strange to stabilize on LLVM's format when even Clang doesn't do that. Presumably it does use LLVM's support internally, but it presents an interface more like GCC.

@bstrie

This comment has been minimized.

Contributor

bstrie commented Oct 5, 2017

I'm 100% fine with saying "Rust supports exactly what Clang supports" and calling it a day, especially since AFAIK Clang's stance is "Clang supports exactly what GCC supports". If we ever have a real Rust spec, we can soften the language to "inline assembly is implementation-defined". Precedence and de-facto standardization are powerful tools. If we can repurpose Clang's own code for translating GCC syntax to LLVM, all the better. The alternative backend concerns don't go away, but theoretically a Rust frontend to GCC wouldn't be much vexed. Less for us to design, less for us to endlessly bikeshed, less for us to teach, less for us to maintain.

@jimblandy

This comment has been minimized.

Contributor

jimblandy commented Oct 5, 2017

If we stabilize something defined in terms of what clang supports, then we should call it clang_asm!. The asm! name should be reserved for something that's been designed through a full RFC process, like other major Rust features. #bikeshed

There are a few things I'd like to see in Rust inline assembly:

  • The template-with-substitutions pattern is ugly. I'm always jumping back and forth between the assembly text and the constraint list. Brevity encourages people to use positional parameters, which makes legibility worse. Symbolic names often mean you have the same name repeated three times: in the template, naming the operand, and in the expression being bound to the operand. The slides mentioned in Alex's comment show that D and MSVC let you simply reference variables in the code, which seems much nicer.

  • Constraints are both hard to understand, and (mostly) redundant with the assembly code. If Rust had an integrated assembler with a sufficiently detailed model of the instructions, it could infer the constraints on the operands, removing a source of error and confusion. If the programmer needs a specific encoding of the instruction, then they would need to supply an explicit constraint, but this would usually not be necessary.

Norman Ramsey and Mary Fernández wrote some papers about the New Jersey Machine Code Toolkit way back when that have excellent ideas for describing assembly/machine language pairs in a compact way. They tackle (Pentium Pro-era) iA-32 instruction encodings; it is not at all limited to neat RISC ISAs.

@alexcrichton

This comment has been minimized.

Member

alexcrichton commented Oct 5, 2017

I'd like to reiterate again the conclusions from the most recent work week:

  • Today, as far as we know, there's basically no documentation for this feature. This includes LLVM internals and all.
  • We have, as far as we know, no guarantee of stability from LLVM. For all we know the implementation of inline assembly in LLVM could change any day.
  • This is, currently, a very buggy feature in rustc. It's chock full of (at compile time) segfaults, ICEs, and weird LLVM errors.
  • Without a specification it's nigh impossible to even imagine an alternate backend for this.

To me this is the definition of "if we stabilize this now we will guarantee to regret it in the future", and not only "regret it" but seems very likely for "causes serious problems to implement any new system".

At the absolute bare minimum I'd firmly believe that bullet (2) cannot be compromised on (aka the definition of stable in "stable channel"). The other bullets would be quite sad into forgo as it erodes the expected quality of the Rust compiler which is currently quite high.

@jimblandy

This comment has been minimized.

Contributor

jimblandy commented Oct 5, 2017

@jcranmer wrote:

LLVM does a surprisingly bad job in practice of actually identifying which registers get clobbered, particularly in instructions not generated by LLVM. This means it's pretty much necessary for the user to manually specify which registers get clobbered.

I would think that, in practice, it would be quite difficult to infer clobber lists. Just because a machine-language fragment uses a register doesn't mean it clobbers it; perhaps it saves it and restores it. Conservative approaches could discourage the code generator from using registers that would be fine to use.

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Aug 21, 2018

we'd love to see some movement on stabilizing inline and module-level assembler.

The last pre-RFC (https://internals.rust-lang.org/t/pre-rfc-inline-assembly/6443) achieved consensus 6 months ago (at least on most of the fundamental issues), so the next step is to submit an RFC that builds on that. If you want this to happen faster I'd recommend contacting @Florob about it.

@MSxDOS

This comment has been minimized.

MSxDOS commented Aug 21, 2018

For what it's worth, I need direct access to FS\GS registers to get the pointer to the TEB struct on Windows, I also need a _bittest64-like intrinsic to apply bt to an arbitrary memory location, neither of which I could find a way to do without inline assembly or extern calls.

The third point mentioned here concerns me, though, as LLVM indeed prefers to Just Crash if something is wrong providing no error messaging what so ever.

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Aug 21, 2018

@MSxDOS

I also need a _bittest64-like intrinsic to apply bt to an arbitrary memory location, neither of which I could find a way to do without inline assembly or extern calls.

It shouldn't be hard to add that one to stdsimd, clang implements these using inline assembly (https://github.com/llvm-mirror/clang/blob/c1c07cca8cae5f924cedaac7b202b0f3c167111d/test/CodeGen/bittest-intrin.c#L45) but we can use that in the std library and expose the intrinsic to safe Rust.

Feel encouraged to open an issue in the stdsimd repo about the missing intrinsics.

@eddyb

This comment has been minimized.

Member

eddyb commented Aug 22, 2018

@josevalaad

Well, when you are talking about assembly in math, you are basically talking about using the SIMD registers and instructions like _mm256_mul_pd, _mm256_permute2f128_pd, etc. and vectorization operations where it proceed.

Ah, I suspected that might be the case. Well, if you want to give it a try, you could translate the assembly into std::arch intrinsic calls and see if you get the same performance out of it.

If you don't, please file issues. LLVM isn't magic, but at least intrinsics should be as good as asm.

@eddyb

This comment has been minimized.

Member

eddyb commented Aug 22, 2018

@dancrossnyc If you don't mind me asking, are there any usecases/platform features in particular that require inline assembly, in your situation?

@MSxDOS Maybe we should expose intrinsics for reading the "segment" registers?


Maybe we should do some data collection and get a breakdown of what people really want asm! for, and see how many of those could be supported in some other way.

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Aug 22, 2018

Maybe we should do some data collection and get a breakdown of what people really want asm!

I want asm! for:

  • working around intrinsics not provided by the compiler
  • working around compiler bugs / sub-optimal code generation
  • performing operations that cannot be performed via a sequence of single intrinsics calls, e.g., a read EFLAGS-modify-write EFLAGS where LLVM is allowed to modify eflags between the read and the write, and where LLVM also assumes that the user won't modify this behind its back (that is, the only way to safely work with EFLAGS is to write the read-modify-write operations as a single atomic asm! block).

and see how many of those could be supported in some other way.

I don't see any other way of supporting any of those use cases that doesn't involve some form of inline assembly but my mind is open.

@Amanieu

This comment has been minimized.

Contributor

Amanieu commented Aug 22, 2018

Copied from my post in the pre-RFC thread, here is some inline assembly (ARM64) which I am using in my current project:

// Common code for interruptible syscalls
macro_rules! asm_interruptible_syscall {
    () => {
        r#"
            # If a signal interrupts us between 0 and 1, the signal handler
            # will rewind the PC back to 0 so that the interrupt flag check is
            # atomic.
            0:
                ldrb ${0:w}, $2
                cbnz ${0:w}, 2f
            1:
               svc #0
            2:

            # Record the range of instructions which should be atomic.
            .section interrupt_restart_list, "aw"
            .quad 0b
            .quad 1b
            .previous
        "#
    };
}

// There are other versions of this function with different numbers of
// arguments, however they all share the same asm code above.
#[inline]
pub unsafe fn interruptible_syscall3(
    interrupt_flag: &AtomicBool,
    nr: usize,
    arg0: usize,
    arg1: usize,
    arg2: usize,
) -> Interruptible<usize> {
    let result;
    let interrupted: u64;
    asm!(
        asm_interruptible_syscall!()
        : "=&r" (interrupted)
          "={x0}" (result)
        : "*m" (interrupt_flag)
          "{x8}" (nr as u64)
          "{x0}" (arg0 as u64)
          "{x1}" (arg1 as u64)
          "{x2}" (arg2 as u64)
        : "x8", "memory"
        : "volatile"
    );
    if interrupted == 0 {
        Ok(result)
    } else {
        Err(Interrupted)
    }
}
@shepmaster

This comment has been minimized.

Member

shepmaster commented Aug 22, 2018

@Amanieu note that @japaric is working towards the intrinsics for ARM. Would be worth checking to see if that proposal covers your needs.

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Aug 22, 2018

@shepmaster

@Amanieu note that @japaric is working towards the intrinsics for ARM. Would be worth checking to see if that proposal covers your needs.

It is worth remarking that:

  • this work doesn't replace inline assembly, it merely complements it. This approach implements vendor APIs in std::arch, these APIs are insufficient for some people already.

  • this approach is only usable when a sequence of intrinsic calls like foo(); bar(); baz(); produces code indistinguishable from that sequence of instructions - this isn't necessarily the case, and when it isn't, code that looks correct produces at best incorrect results, and at worst has undefined behavior (we had bugs due to this in x86 and x86_64 in std already, e.g., https://github.com/rust-lang-nursery/stdsimd/blob/master/coresimd/x86/cpuid.rs#L108 - other architectures have these issues as well).

  • some intrinsics have immediate mode arguments, which you cannot pass via a function call, so that foo(3) won't work. Every solution to this problem is currently a whacky workaround, and in some cases, no workarounds are currently possible in Rust, so we just don't provide some of these intrinsics.

So if the vendor APIs are implementable in Rust, available on std::arch, and can be combined to solve a problem, I agree that they are better than inline assembly. But every now and then either the APIs are not available, maybe not even implementable, and / or they cannot be combined correctly. While we could fix the "implementability issues" in the future, if what you want to do is not exposed by the vendor API, or the APIs cannot be combined, this approach won't help you.

@main--

This comment has been minimized.

main-- commented Aug 22, 2018

What can be very surprising about LLVM's implementation of intrinsics (SIMD especially) is that they do not conform to Intel's explicit mapping of intrinsics to instructions at all - they are subject to a wide range of compiler optimizations. For instance I remember one time where I attempted to reduce memory pressure by calculating some constants from other constants instead of loading them from memory. But LLVM simply proceeded to constant-fold the entire thing back into the exact memory load I was trying to avoid. In a different case I wanted to investigate replacing a 16-bit shuffle with an 8-bit shuffle to reduce port5 pressure. Yet in its unending wisdom the ever-helpful LLVM optimizer noticed that my 8-bit shuffle is in fact a 16-bit shuffle and replaced it.

Both optimizations certainly yield better throughput (especially in the face of hyperthreading) but not the latency reduction I was hoping to achieve. I ended up dropping down all the way to nasm for that experiment but having to rewrite the code from intrinsics to plain asm was just unnecessary friction. Of course I want the optimizer to handle things like instruction selection or constant folding when using some high-level vector API. But when I explicitly decided which instructions to use I really don't want the compiler to mess around with that. The only alternative is inline asm.

@shepmaster

This comment has been minimized.

Member

shepmaster commented Aug 22, 2018

So if the vendor APIs are implementable in Rust, available on std::arch, and can be combined to solve a problem, I agree that they are better than inline assembly

That's all I've been saying at first

accomplish 95-99% of the goals

and again

Yes, there are cases that require assembly for now and there are cases that will forever need it, I said as much originally (added emphasis for clarity):

It's most people's expectation that [intrinsics] are going to accomplish 95-99% of the goals.

This is the same thing that @eddyb is saying in parallel. I'm unclear why multiple people are acting like I'm completely disregarding the usefulness of inline assembly while trying to point out the realities of the current situation.

I've

  1. Pointed one poster who made no mention of knowing that intrinsics existed towards stable today intrinsics.
  2. Pointed another poster at proposed intrinsics so they can provide early feedback to the proposal.

Let me state this very clearly: yes, inline assembly is sometimes required and good. I am not arguing that. I am only trying to help people solve real world problems with the tools that are available now.

@eddyb

This comment has been minimized.

Member

eddyb commented Aug 22, 2018

What I was trying to say was that we should have a more organized approach to this, a proper survey, and gather up a lot more data than the few of us in this thread, and then use that to point out the most common needs from inline assembly (since it's clear that intrinsics can't fully replace it).

I suspect that each architecture has a tricky-to-model subset, that gets some use from inline asm!, and maybe we should focus on those subsets, and then try to generalize.

cc @rust-lang/lang

@dancrossnyc

This comment has been minimized.

dancrossnyc commented Aug 22, 2018

@eddyb require is a strong word, and I would be compelled to say that no we're not strictly required to use inline assembler. As I mentioned earlier, we could define procedures in pure assembly language, assemble them separately, and link them into our Rust programs via the FFI.

However, as I said earlier I know of no serious OS-level project that does that. It would mean lots of boiler plate (read: more chances to make a mistake), a more complex build process (right now we're fortunate enough that we can get away with a simple cargo invocation and a linked and nearly-ready-to-run kernel pops out of the other end; we'd have to invoke the assembler and link in a separate step), and a drastic decrease in the ability to inline things, etc; there would almost certainly be a performance hit.

Things like compiler intrinsics help in a lot of cases, but for things like the supervisory instruction set of the target ISA, particularly more esoteric hardware features (hypervisor and enclave features, for example), there often aren't intrinsics and we're in a no_std environment. What intrinsics are there often aren't sufficient; e.g., the x86-interrupt calling convention seems cool but doesn't give you mutable access to the general purpose registers in a trap frame: suppose I take an undefined instruction exception with the intent to do emulation, and suppose the emulated instruction returns a value in %rax or something; the calling convention doesn't give me a good way to pass that back to the call-site, so we had to roll our own. That meant writing my own exception handling code in assembler.

So to be honest no, we don't require inline assembler, but it is sufficiently useful that it would almost be a non-starter not to have it.

@eddyb

This comment has been minimized.

Member

eddyb commented Aug 22, 2018

@dancrossnyc I am specifically curious about avoiding separate assembling, that is, what kind of assembly you need at all in your project, no matter how you link it in.

In your case it seems to be a supervisor/hypervisor/enclave privileged ISA subset, is that correct?

there often aren't intrinsics

Is this by necessity, i.e. do the instructions have requirements which are unreasonably difficult or even impossible to uphold when compiled as intrinsic calls through, e.g. LLVM?
Or is this just because they're assumed to be too special-cased to be useful to most developers?

and we're in a no_std environment

For the record, vendor intrinsics are in both std::arch and core::arch (the former is a reexport).

the x86-interrupt calling convention seems cool but doesn't give you mutable access to the general purpose registers in a trap frame

cc @rkruppe Can this be implemented in LLVM?

@dancrossnyc

This comment has been minimized.

dancrossnyc commented Aug 22, 2018

@eddyb correct; we need the supervisor subset of the ISA. I'm afraid I can't say much more at the moment about our specific use case.

Is this by necessity, i.e. do the instructions have requirements which are unreasonably difficult or even impossible to uphold when compiled as intrinsic calls through, e.g. LLVM?
Or is this just because they're assumed to be too special-cased to be useful to most developers?

To some extent both are true, but on balance i would say the latter is more relevant here. Some things are microarchitecture specific and dependent on specific processor package configurations. Would it be reasonable for a compiler to (for example) expose something as an intrinsic that's part of the privileged instruction subset and conditioned on a specific processor version? I honestly don't know.

For the record, vendor intrinsics are in both std::arch and core::arch (the former is a reexport).

That's actually really good to know. Thanks!

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Aug 22, 2018

Would it be reasonable for a compiler to (for example) expose something as an intrinsic that's part of the privileged instruction subset and conditioned on a specific processor version? I honestly don't know.

We already do. For example, the xsave x86 instructions are implemented and exposed in core::arch, not available on all processors, and most of them require privileged mode.

@dancrossnyc

This comment has been minimized.

dancrossnyc commented Aug 22, 2018

@gnzlbg xsave isn't privileged; did you mean xsaves?

I took a look through https://rust-lang-nursery.github.io/stdsimd/x86_64/stdsimd/arch/x86_64/index.html and the only privileged instructions I saw in my quick sweep (I didn't do an exhaustive search) were xsaves, xsaves64, xrstors, and xrstors64. I suspect those are intrinsics because they fall into the general XSAVE* family and don't generate exceptions in real mode, and some folks want to use clang/llvm to compile real-mode code.

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Aug 22, 2018

@dancrossnyc yes some of those are the ones I meant (we implement xsave, xsaves, xsaveopt, ... in the xsave module: https://github.com/rust-lang-nursery/stdsimd/blob/master/coresimd/x86/xsave.rs).

These are available in core, so you can use them to write an OS kernel for x86. In user-space they are useless AFAICT (they'll always raise an exception), but we don't have a way to distinguish about this in core. We could only expose them in core and not in std though, but since they are already stable, that ship has sailed. Who knows, maybe some OS runs everything in ring 0 someday, and you can use them there...

@dancrossnyc

This comment has been minimized.

dancrossnyc commented Aug 22, 2018

@gnzlbg I don't know why xsaveopt or xsave would raise an exception in userspace: xsaves is the only one of the family that's defined to generate an exception (#GP if CPL>0), and then only in protected mode (SDM vol.1 ch. 13; vol.2C ch. 5 XSAVES). xsave and xsaveopt are useful for implementing e.g. pre-emptive user-space threads, so their presence as intrinsics actually makes sense. I suspect the intrinsic for xsaves was either because someone just added everything from the xsave family without realizing the privilege issue (that is, assuming it was invocable from userspace), or someone wanted to call it from real mode. That latter may seem far-fetched but I know people are e.g. building real-mode firmware with Clang and LLVM.

Don't get me wrong; the presence of LLVM intrinsics in core is great; if I never have to write that silly sequence of instructions to get the results of rdtscp into a useful format again, I'll be happy. But the current set of intrinsics are not a substitute for inline assembler when you're writing a kernel or other bare-metal supervisory sort of thing.

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Aug 22, 2018

@dancrossnyc when I mentioned xsave I was referring to some of the intrinsics that are available behind the CPUID bits XSAVE, XSAVEOPT, XSAVEC, etc. Some of these intrinsics require privileged mode.

Would it be reasonable for a compiler to (for example) expose something as an intrinsic that's part of the privileged instruction subset and conditioned on a specific processor version?

We already do and they are available in stable Rust.

I suspect the intrinsic for xsaves was either because someone just added everything from the xsave family without realizing the privilege issue

I added these intrinsics. We realized the privilege issues and decided to add them anyways because it is perfectly fine for a program depending on coreto be an OS kernel that wants to use these, and they are harmless in userspace (as in, if you try to use them, your process terminates).

But the current set of intrinsics are not a substitute for inline assembler when you're writing a kernel or other bare-metal supervisory sort of thing.

Agreed, that's why this issue is still open ;)

@dancrossnyc

This comment has been minimized.

dancrossnyc commented Aug 22, 2018

@gnzlbg sorry, I don't mean to derail this by rabbit-holing on xsave et al.

However, as near as I can tell, the only intrinsics that require privileged execution are those related to xsaves and even then it's not always privileged (again, real-mode doesn't care). It's wonderful that those are available in stable Rust (seriously). The others might be useful in userspace and similarly I think it's great that they're there. However, xsaves and xrstors are a very, very small portion of the privileged instruction set and having added intrinsics for two instructions is qualitatively different than doing so generally and I think the question remains as to whether it's appropriate in general. Consider the VMWRITE instruction from the VMX extensions, for example; I imagine an intrinsic would do something like execute instruction and then "return" rflags. That's sort of an oddly specialized thing to have as an intrinsic.

I think otherwise we're in agreement here.

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Aug 22, 2018

FWIW per the std::arch RFC we can currently only add intrinsics to std::arch that the vendors expose in their APIs. For the case of xsave, Intel exposes them on its C API, so that's why it is ok that's there. If you need any vendor intrinsics that are not currently exposed, open an issue, whether it requires privileged mode or not is irrelevant.

If the vendor doesn't expose an intrinsic for it, then std::arch might not be the place for it, but there are many alternatives to that (inline assembly, global asm, calling C, ...).

@dancrossnyc

This comment has been minimized.

dancrossnyc commented Aug 22, 2018

Sorry, I understood you saying you wrote the intrinsics for xsave to mean the Intel intrinsics; my earlier comments still apply as to why I think xsaves is an intrinsic then (either an accident by a compiler writer at Intel or because someone wanted it for real mode; I feel like the former would be noticed really quickly but firmware does weird stuff, so the latter wouldn't surprise me at all).

Anyway, yes, I think we fundamentally agree: intrinsics aren't the place for everything, and that's why we'd like to see asm!() moved to stable. I'm really excited to hear that progress is being made in this area, as you said yesterday, and if we can gently nudge @Florob to bubble this up closer to the top of the stack, we'd be happy to do so!

@joshtriplett

This comment has been minimized.

Member

joshtriplett commented Aug 22, 2018

A few additional details and use cases for asm!:

When you're writing an operating system, firmware, certain types of libraries, or certain other types of system code, you need full access to platform-level assembly. Even if we had intrinsics that exposed every single instruction in every architecture Rust supports (which we don't come anywhere close to having), that still wouldn't be enough for some of the stunts that people regularly pull with inline assembly.

Here are a small fraction of things you can do with inline assembly that you can't easily do in other ways. Every single one of these is a real-world example I've seen (or in some cases written), not a hypothetical.

  • Collect all the implementations of a particular pattern of instructions in a separate ELF section, and then in loading code, patch that section at runtime based on characteristics of the system you run on.
  • Write a jump instruction whose target gets patched at runtime.
  • Emit an exact sequence of instructions (so you can't count on intrinsics for the individual instructions), so that you can implement a pattern that carefully handles potential interruptions in the middle.
  • Emit an instruction, followed by a jump to the end of the asm block, followed by fault recovery code for a hardware fault handler to jump to if the instruction generates a fault.
  • Emit a sequence of bytes corresponding to an instruction the assembler doesn't know about yet.
  • Write a piece of code that carefully switches to a different stack and then calls another function.
  • Call assembly routines or system calls that require arguments in specific registers.
@dancrossnyc

This comment has been minimized.

dancrossnyc commented Aug 22, 2018

+1e6

@josevalaad

This comment has been minimized.

josevalaad commented Aug 23, 2018

@eddyb

Ok, I will try the intrinsics approach and see where it takes. You are probably right and that's the best approach for my case. Thank you!

@mark-i-m

This comment has been minimized.

Contributor

mark-i-m commented Aug 27, 2018

@joshtriplett nailed it! These are the exact use cases I had in mind.

loop {
   :thumbs_up:
}

I would add a couple of other use cases:

  • writing code in weird architectural modes, like BIOS/EFI calls and 16-bit real-mode.
  • writing code with strange/unusual addressing modes (which comes up often in 16-bit real-mode, bootloaders, etc.)
@joshtriplett

This comment has been minimized.

Member

joshtriplett commented Aug 27, 2018

@mark-i-m Absolutely! And generalizing a point that has sub-cases in both of our lists: translating between calling conventions.

@nbp nbp referenced this issue Sep 11, 2018

Open

Inline Assembly. #444

@mqudsi

This comment has been minimized.

mqudsi commented Dec 14, 2018

I am closing out #53118 in favor of this issue and copying the PR here for the record. Note that this is from August, but a brief look seems to indicate the situation hasn't changed:


The section on inline assembly needs an overhaul; in its present state it implies that the behavior and syntax is tied to rustc and the rust language in general. Pretty much the entire documentation is specific to x86/x86_64 assembly with the llvm toolchain. To be clear, I am not referring to the assembly code itself, which is obviously platform-specific, but rather the general architecture and usage of inline assembly altogether.

I didn't find an authoritative source for the behavior of inline assembly when it comes to ARM target, but per my experimentation and referencing the ARM GCC inline assembly documentation, the following points seem to be completely off:

  • The ASM syntax, as ARM/MIPS (and most other CISC?) use intel-esque syntax with the destination register first. I understood the documentation to mean/imply that inline asm took at&t syntax which was transpiled to actual platform/compiler-specific syntax, and that I should just substitute the names of the x86 registers with that of the ARM registers only.
  • Relatedly, the intel option is invalid, as is it causes "unknown directive" errors when compiling.
  • Adapting from the ARM GCC inline assembly documentation (for building against thumbv7em-none-eabi with the arm-none-eabi-* toolchain, it appears that even some basic assumptions about the format of inline assembly are platform-specific. In particular, it seems that for ARM the output register (second macro argument) counts as a register reference, i.e. $0 refers to the first output register and not the first input register, as is the case with the x86 llvm instructions.
  • At the same time, other compiler-specific features are not present; I can't use named references to registers, only indexes (e.g. asm("mov %[result],%[value],ror #1":[result] "=r" (y):[value] "r" (x)); is invalid).
  • (Even for x86/x86_64 targets, the usage of $0 and $2 in the inline assembly example is very confusing, as it does not explain why those numbers were chosen.)

I think what threw me the most is the closing statement:

The current implementation of the asm! macro is a direct binding to LLVM's inline assembler expressions, so be sure to check out their documentation as well for more information about clobbers, constraints, etc.

Which does not seem to be universally true.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment