Enhancement: New Inline Assembly #5241

ghost · 2020-05-01T15:51:47Z

Copied over from #215. Inspiration is via them. Thanks also to @MasterQ32 and @kubkon for help extending it to support stack machine architectures. See #7561 for standalone assembler improvements.

New Inline Assembly

asm volatile? {bindings}? body? : post_expression?

TL;DR: Benefits over Status Quo

No mandatory sections -- flexible to any application
Components are listed in evaluation order
First-class support for stack machine architectures
First-class support for floating point, vector, and as yet unforeseen register types
Operands have types
Named inputs are optional
Input/output characteristics fully customisable
Not bound to an input/output model
Can access program symbols and call functions safely
Volatility is inferred in most cases
Concise, flexible wildcard syntax
Substitution syntax easier to scan and less likely to clash with native symbols
Open to architecture-specific extensions
Communicates stack-relevant metadata to compiler
Can be automatically distinguished from status quo; no sneaky breakages

Stack Machines

This syntax has first-class support for stack machine architectures such as WebAssembly, the JVM, and @MasterQ32's SPU Mk. II. It accomplishes this with a novel batch-push and -pop mechanism for marshaling between Zig and the stack. Because there is significant difference between register and stack machine architectures, a new .paradigm() method is defined on builtin.Arch, which returns an enum with the variants .register and .stack. (NOTE: supporting stack machines with LLVM is a very hard problem -- maybe defer to stage 2?)

Volatile

This block has side effects, and may not be optimised away if its value is not used. Implied by a return type of void or noreturn, or a mutable symbol binding -- so, in practice, very rarely used.

Bindings

There are three types of bindings: operand, symbol, and clobber. All of them use specially formatted comptime strings to interface with assembly, as in status quo. This decision was made as integrating the required functionality into Zig itself would have required either breaking several guidelines or introducing special constructs with no other use cases.

Operand

An operand binding has the form "operand" name: type = value. Within the block, ?(name) then refers to operand compatible with Zig type type, initially with value value, which may be a register (integer, float, or vector), a datum literal (only integer in every ISA I'm aware of), a stack top (array with size a multiple of stack alignment), or a processor condition code (boolean). type must be coercible to all of name's uses in the block, taking into account sign- or zero-extension and lane width/count if applicable, and may be omitted if the type of value is known -- in addition, value may be omitted if initialisation is not needed, and name may be omitted if only initialisation is needed. The type of the binding must be derivable -- that is, at least one of type or value must be present (this also means that operand and symbol bindings are syntactically distinct). Stack pushes and pops must be declared separately -- see below. Condition codes may not be initialised (type must be present and must be bool). operand may be a wildcard, as described below.

Symbol

A symbol binding has the form "type" const? symbol, where symbol is a program symbol in scope. type is a wildcard indicating the type of symbol, which could be a variable or a function. Within the block, ?(symbol) then refers to the assembly program entity corresponding to the Zig program construct (which need not be an exported symbol -- it may be an internal label, a simple address, or even the referenced data itself on stack machines). A const annotation indicates an immutable binding -- this may be safety-checked by comparing the value at the associated address before and after the block. (NOTE: In some assemblies, many label operations are actually macros, which expand to multiple instructions and relocations -- we'd need some way of propagating this information through the compilation pipeline from codegen to linking.)

Clobber

A clobber is simply "location", which may be a literal or a wildcard.

Wildcards

Wildcards indicate that a binding has special properties, and give the compiler freedom to fill in some details. Wildcards start with ? and run the length of the binding string. A literal ? is escaped with another one, for symmetry with in-block syntax. Wildcards may be followed by architecture-dependent :options to place restrictions on their resolution -- for instance, ?reg:abcd for a legacy x86 register on x86_64, or ?int:lo12 for a 12-bit integer immediate on RISC-V. Options may change the type of a binding -- for instance, "?tmp:all" callconv(.fast) is a clobber that binds all callee-saved registers under the fast calling convention.

The following wildcards are defined:

Operand

?reg
Arbitrary register. Register machine architectures only. value may be an integer, a float, or an int/float vector, of any architecturally-supported width and length.
?tmp
Arbitrary caller-saved register under current calling convention. See above. May be annotated with callconv to specify a different calling convention.
?sav
Arbitrary callee-saved register under current calling convention. See above.
?lit
Literal. value must be comptime-known, and may be any architecturally-supported literal type.
?psh
Array. value must be provided. Length * element size must be a multiple of platform stack alignment; elements must be size-compatible with stack cells if applicable. Pushed onto the stack at block entry, leftmost element topmost. Only one allowed per block. This is the only way of marshaling non-symbol values into assembly on stack machines.
?pop
Uninitialised array (value must not be provided). See above. Popped from the stack on block exit, topmost element leftmost. This is the only way of marshaling non-symbol values out of assembly on stack machines.
?stg
Additional stack growth, i.e. growth not already accounted for by ?push or function calls, in bytes. name, type omitted. value must be comptime-known. (NOTE: This does not imply that the stack pointer has a different value before and after the block -- in fact, unless it is listed as a clobber, this is not allowed.)

Symbol

?locl
Local variable. Stack machine only.
?argm
Argument of current function. Stack machine only. Implies const.
?glob
Global variable.
?thdl
Thread-local variable.
?comp
Comptime-known variable/constant. Substitution semantics of a literal. Implies const.
?func
Function. Registers symbol in this block's call graph. Implies const.

Clobber

?memory
Unspecified memory.
?status
Processor status flags.

Body

The assembly code itself, as a comptime string. For symbol scoping purposes, treated as a separate file, i.e. declared symbols do not leak to the rest of the program and elsewhere-defined symbols are not visible except through bindings. May be omitted if only values of registers are desired.

Bound operands and symbols are accessed within the block by enclosing their names in ?(). This syntax was chosen as the ? character is far less commonly used in assembly languages than %, and pairs well with the theme of an unknown resolution -- additionally, parentheses are less likely to have semantic significance than square brackets, so the code is easier to scan. Accessing an unbound name in this manner is a compile error. As with wildcards, names may be modified with :options, for instance ?(r:hi) to access the high byte of register r, or ?(i:x) to print integer i in hexadecimal. A literal ? is escaped with another one, as regular escaping is not possible in multiline strings.

Post Expression

An expression evaluated after the body, using the final values of all bindings. Becomes the value of the whole block. Preceded by a colon. May be omitted without ambiguity, in which case the return type is void. This permits us to return as many values as we like, in whatever format and location we choose. Moreover, we don't have to specify the exact lifetimes of all of our inputs and outputs to appease the optimiser -- we can decide for ourselves how our values are allocated and consumed.

Examples

Simple, bindless assembly is simple:

comptime assert(builtin.arch == .x86_64);

// No unused names, types on everything
asm { "rax": u64 = 60, "rdi": u64 = 0 } "syscall";

// No unnecessary detail
starting_stack_ptr = asm { "rsp" sp: usize } : sp;

More involved assembly is logical:

// Using #1717 syntax because that proposal has been accepted
// -- this proposal does not depend on #1717
const vendorId = fn () void {
    comptime assert(builtin.arch == .x86_64);

    // Multiple return values, anyone?
    return asm {
        "eax": u32 = 0,
        "ebx" b: u32,
        "ecx" c: u32,
        "edx" d: u32,
        "?memory",
    } "cpuid"
    : .{ b, c, d };
};

// In case we have trouble getting RLS working, we can do it directly
const vendorId2 = fn (result: *[3]u32) void {
    comptime assert(builtin.arch == .x86_64);

    // void return type implies volatile
    asm {
        "eax": u32 = 0,
        "ebx" b: u32,
        "ecx" c: u32,
        "edx" d: u32,
        "?memory",
    } "cpuid"
    : {
        result[0] = b;
        result[1] = c;
        result[2] = d;
    }
};

A simple bare-metal OS entry point on RISC-V:

const stack_height = 16 * 1024;
var stack: [stack_height]usize = undefined;

const _start = fn callconv(.naked) () noreturn {
    comptime assert(builtin.arch == .riscv64);

    asm {
        "?func" kmain,
        "?glob" stack,

        "?reg" stack_size: usize = stack_height,
        "?int" slot_shift: usize = @ctz(@sizeOf(usize)),
        "sp", "ra", "t1",
    }
    \\ slli ?(stack_size), ?(stack_size), ?(slot_shift)
    \\ la sp, ?(stack)
    \\ add sp, sp, ?(stack_size)
    \\ call ?(kmain)
    : unreachable;
};

const kmain = fn () noreturn {
    // kernel kernel kernel
};

POSIX startcode (adapted from lib/std/start.zig):

const _start = fn callconv(.naked) () noreturn {
    if (builtin.os.tag == .wasi) {
        std.os.wasi.proc_exit(@call(.{ .modifier = .always_inline }, callMain, .{}));
    }

    asm {
        "?reg" stack_ptr: [*]usize,
    // Much more compact and local
    } switch (builtin.arch) {
        .x86_64 => "mov ?(stack_ptr), rsp",
        .i386 => "mov ?(stack_ptr), esp",
        .aarch64, .aarch64_be, .arm => "mov ?(stack_ptr), sp",
        .riscv64 => "mv ?(stack_ptr), sp"
        .mips, .mipsel => (
          \\ .set noat
          \\ move ?(stack_ptr), $sp
        ),
        else => @compileError("unsupported arch"),
    }
    // By the time we get here, we have the stack pointer
    // -- so, no global required
    : @call(.{ .modifier = .never_inline }, posixCallMainAndExit, .{ stack_ptr });
};

The text was updated successfully, but these errors were encountered:

lerno · 2021-08-08T21:51:46Z

Compared to GCC style syntax this is much more verbose. So would it really make sense to do this?

What people really like is MSVC style, where clobbers, register allocation etc are mostly inferred by the compiler. This requires lots of good defaults, but is great to work with: downside is a lot of loss of control, which could be regained with some clever added optional constraints instead. That said it might violate the Zig explicitness goal.

So if the ideal isn't achievable, why pick a style that is completely new for Zig, rather than the de facto standard?

lerno · 2021-08-09T10:03:34Z

Also, everything is still strings, so it's basically leaving everything stringly typed. Compare to the Rust asm that at least defines what looks as constants for things like xmm_reg behaves as an actual constant rather than a random string. Similarly registers in this proposals are also just strings.

I suspect people will just consider this a confusing, hobbled and verbose version of GCC inline asm.

Compare this:

return asm {
        "eax": u32 = 0,
        "ebx" b: u32,
        "ecx" c: u32,
        "edx" d: u32,
        "?memory",
    } "cpuid"
    : .{ b, c, d };

To

static inline void cpuid(int code, uint32_t* a, uint32_t* d)
{
    asm volatile ( "cpuid" : "=a"(*a), "=d"(*d) : "0"(code) : "ebx", "ecx" );
}

You don't like that? Here's another one using code = 1 without the memory:

int a = 0x1, b, c, d;
asm ( "cpuid"  : "=a" (a), "=b" (b), "=c" (c), "=d" (d) : "0" (a) );

So what is the benefit?

lerno · 2021-08-09T10:04:55Z

For comparison, the Rust new asm: https://doc.rust-lang.org/beta/unstable-book/library-features/asm.html

N00byEdge · 2022-02-01T22:59:25Z

I like this syntax. A lot. But there is one issue here I think just looks a little weird to me

const _start = fn callconv(.naked) () noreturn {
    if (builtin.os.tag == .wasi) {
        std.os.wasi.proc_exit(@call(.{ .modifier = .always_inline }, callMain, .{}));
    }

    asm {
        "?reg" stack_ptr: [*]usize,
    // Much more compact and local
    } switch (builtin.arch) {
        .x86_64 => "mov ?(stack_ptr), rsp",
        .i386 => "mov ?(stack_ptr), esp",
        .aarch64, .aarch64_be, .arm => "mov ?(stack_ptr), sp",
        .riscv64 => "mv ?(stack_ptr), sp"
        .mips, .mipsel => (
          \\ .set noat
          \\ move ?(stack_ptr), $sp
        ),
        else => @compileError("unsupported arch"),
    }
    // By the time we get here, we have the stack pointer
    // -- so, no global required
    : @call(.{ .modifier = .never_inline }, posixCallMainAndExit, .{ stack_ptr });
};

In here, I just want to grab the value of a register. I don't care about what the mov instruction looks like, and I believe that should be left to the compiler to figure out. Should putting empty asm and replacing "?reg" with a switch on the arch returning "rsp" etc be allowed instead?

ghost · 2022-02-01T23:05:11Z

Hmm, I hadn’t thought of that. My instinct is to allow this, but I’m not sure if this would lead to parsing ambiguity. If not, then sure.

ethindp · 2023-07-28T21:00:33Z

I have an alternative proposal that, I think, will be much clearer, and far different from GCC inline asm, or any other asm syntax I've seen. And it won't be just string hackery, either. I imagine this proposal will take a long time to actually implement, but it'll be much, much clearer, and very elegant, and fits Zig's Zen (whereas the current proposal doesn't).

The general idea are asm blocks. The syntax is similar, but with some significant differences:

As is currently done, asm blocks begin with the keyword asm, followed by the optional keyword volatile. Then, either:
- a parenthesized list of semicolon-separated inputs, outputs, and clobber specifications followed by a block; or
- a block.
- Inputs, outputs, and clobbers are specified before any assembly statements. (I use "assembly statements" here deliberately, see below.) Input specifications take the form inputs: element1, element2, element3, ...; output and clobber specifications are similar but using the keyword outputs or clobbers, respectively.
- Input, output, and clobber specifications take input, output, or clobber elements. An input, output, or clobber element can be either of the following:
  - the form arg = value, where arg can be a register or variable, and value can either be a register, variable, or 'memory'; or
  - the form value, which can either be a register or variable, or for the case of clobbers, 'memory'.
- In the second specification form, the (actual) value is the only way to refer to said value in assembly statements; the first form could be considered a "renaming" of the item.
- The input/output/clobber specifications are optional, but if parentheses come before the block, at least one of those specifications must be provided. There can only be one of all of the specifications in any given asm block.
- If the input or output is a compound data type (array, slice, struct, union, ...), that entire compound data type is considered as the input or output; you cannot solely use as an input or output a constituant field or element of that data type.
The block contains assembly statements. An assembly statement can either be an assignment statement or instruction.
In the case of an assignment statement, the form is a = b;, just as in zig. However, assignments must be "split" assignments; that is, you cannot do a[2..5] = 3;. This is because assignment statements are directly translated into loads and stores, and this version of the syntax doesn't allow for multi-loads and stores because that would be quite complex depending on the architecture, and this proposal is already complex as is. Perhaps we can change this in the future, but I think this is an acceptable limitation for now.
The LHS of an assignment statement must be a register, dereference, or constituant element or field of a compound data type. You cannot write to arbitrary memory using this construction; for that, you have to use actual instructions. This is mainly because allowing arbitrary address writes and reads would look quite odd (at least in my opinion), and the usual way of doing this is to load the address into a register and then write to it that way.
In the case of instructions, these look like function calls. This sticks to Zigs "favor reading code over writing code" ideal, and also makes things easier for people who aren't experts with inline assembly. For example, the instruction vpcmpltud k3, ymm3, ymm0 would be translated into vpcmpltud(k3, ymm3, ymm0);. Similarly, the ARM instruction LDR r0, [r1] would be translated into ldr(r0, &r1);. (I'm unsure how to translate an instruction like STMFD sp!, {r0-r3, lr} into this syntax, and would appreciate assistance to refine it.)
Labels have the same syntax as in zig; same for referring to them in assembly statements (e.g. jmp :do_something).

To provide an example in action, here's the classic CPUID on x86, from Agner Fog's asmlib library, which uses a parameter as a return value (but for this example we just allocate it on the stack and use that). The original example is as follows:

cpuid_ex:
%IFDEF   WINDOWS
; parameters: rcx = abcd, edx = a, r8d = c
        push    rbx
        xchg    rcx, r8
        mov     eax, edx
        cpuid                          ; input eax, ecx. output eax, ebx, ecx, edx
        mov     [r8],    eax
        mov     [r8+4],  ebx
        mov     [r8+8],  ecx
        mov     [r8+12], edx
        pop     rbx
%ENDIF        
%IFDEF   UNIX
; parameters: rdi = abcd, esi = a, edx = c
        push    rbx
        mov     eax, esi
        mov     ecx, edx
        cpuid                          ; input eax, ecx. output eax, ebx, ecx, edx
        mov     [rdi],    eax
        mov     [rdi+4],  ebx
        mov     [rdi+8],  ecx
        mov     [rdi+12], edx
        pop     rbx
%ENDIF        
        ret

We'll drop the prologue, and the example in this proposed syntax becomes:

fn cpuid(a: u32, c: u32) [4]u32 {
    var abcd: [4]u32 = undefined;
    asm(inputs: a, c; outputs: abcd; clobbers: eax, ecx) {
        edx = eax;
        cpuid();
        // These are memory-based movs
        abcd[0] = eax;
        abcd[1] = ebx;
        abcd[2] = ecx;
        abcd[3] = edx;
    }
    return abcd;
}

As another example, take a more complex one, loading the GDT (sorry if this isn't quite valid, I'm not the most skilled at this):

Original:

load_gdt:
    push %rbp
    mov %rsp, %rbp
    sub $32, %rsp
    mov 8(%rsp), %rax
    lgdt (%rax)
    pushq $0x08
    lea reload_segment_regs(%rip), %rax
    push %rax
    lretq
reload_segment_regs:
    mov $0x10, %ax
    mov %ax, %ds
    mov %ax, %es
    mov %ax, %fs
    mov %ax, %gs
    mov %ax, %ss
    mov %rbp, %rsp
    pop %rbp
    ret

In this syntax, this becomes:

fn load_gdt(gdt: usize) void {
    asm(inputs: gdt) {
        rax = gdt;
        lgdt(&rax);
        push(0x08);
        lea(:reload_segment_regs); // labels are always PIC/PIE unless `build.zig` explicitly indicates that the executable is not position independent
        push(rax);
        lret(); // long return
        reload_segment_regs:
        // Register-immediate load
            ax = 0x10;
            // register-register load and store
            ds = ax;
            es = ax;
            fs = ax;
            gs = ax;
            ss = ax;
    }
    }

Like I said, this definitely needs refinement and I think that this will take a long time to completely implement. However, I think that this is, most likely, the proposal that upholds Zig's zen and doesn't make inline assembly look like a complete and utter mess. This syntax has the benefit of giving the compiler a lot of information about what your trying to do, so it could very well optimize your loads/stores into something using AVX or neon if possible. What I'm unsure about are things like:

explicit pointer size indicators (e.g. dword ptr)
instruction prefixes (rex64, rex.w, etc.) (though we could perhaps make these just another function call)
ARM ranged loads/stores
AVX masking and broadcasting (e.g. vmovdqu8 zmm16{k1}{z}, [rsi] or vpaddd ymm4 {k2}, ymm4, dword ptr [ADD1] {1to8})
Memory offset (mov rbx, [rax + 0x32] for instance) (though maybe we could do mov(rbx, &(rax op...));?)

If you guys want to help refine this I'd appreciate it. I know that some of the syntactic elements that this introduces are unorthodox, and are quite different from Zig's normal syntax, but I did try to stay as close to Zig as possible while compromising on the fact that this was inline assembly and I didn't really have much of a choice. For the RHS of an assignment statement, most valid expressions are allowed, barring multi-loads or stores; I was thinking that you could even call built-ins as well. When this happened, the load/store would be a multi-load/store, but would finish as a single load/store; for example, if you used xmm0 = @sqrt(...)), and @sqrt resulted in the VSQRTSD instruction, the compiler would translate your code appropriately to execute VSQRTSD and then would do a final load into xmm0. Conversely, if it resulted in a libc function call, the compiler would issue the appropriate instructions for a function call, then (attempt) to store the result in xmm0; if it couldn't, an error would result.

I understand that this syntax would result in "behind your back" instructions in certain instances. In the case of the aforementioned built-in function call idea, discarding that for now would be perfectly reasonable. The assignment statement thing was to eliminate the minutia of a ton of movs, or whatever the target uses for loads and stores, and to instead allow the programmer to focus on what they really are using inline assembly to accomplish. Obviously, if you really wanted to they could fall back to setting everything up themselves; this syntax does not prevent you from using any instruction that the target supports, even if there is a more "abstract" syntax available.

lerno · 2023-07-31T00:22:36Z

@ethindp You might want to draw some inspiration from how C3 does it: https://c3-lang.org/asm/ It creates a very simple, regular grammar and infers clobbers.

ethindp · 2023-07-31T00:41:37Z

@lerno That's an interesting syntax, but IMO it's not as clear as mine (but mine is more complex since I'm trying to be as flexible as possible).

lerno · 2023-07-31T08:21:03Z

@ethindp Yes, the focus is trying to be as cheap as possible to implement for various variants of asm.

ghost mentioned this issue May 1, 2020

Proposal: Access pre-mangled program symbols from inline assembly #5211

Closed

daurnimator added the proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. label May 2, 2020

ghost changed the title ~~Enhancement: New Inline Asm Syntax~~ Enhancement: New Inline Assembly Syntax May 3, 2020

This was referenced May 4, 2020

Proposal: Pragmas #5239

Closed

inline assembly improvements #215

Open

Vexu added this to the 0.7.0 milestone May 6, 2020

ghost mentioned this issue Oct 18, 2020

Bonus Points: The "I'll F**king Show Them" Bootstrapping Plan #6723

Closed

andrewrk modified the milestones: 0.7.0, 0.8.0 Oct 27, 2020

ghost mentioned this issue Nov 11, 2020

Proposal: Generalise SIMD to arbitrary tensor types, remove footguns in vector syntax #7076

Closed

ghost mentioned this issue Dec 18, 2020

Better documentation example for inline assembly. #7488

Open

ghost mentioned this issue Dec 27, 2020

Native Assembler: Improvements, Tweaks, Enhancements #7561

Open

ghost changed the title ~~Enhancement: New Inline Assembly Syntax~~ Enhancement: New Inline Assembly Jan 24, 2021

ghost mentioned this issue Jan 30, 2021

inline branches (asm goto) support. #2085

Open

andrewrk modified the milestones: 0.8.0, 0.9.0 May 19, 2021

andrewrk modified the milestones: 0.9.0, 0.10.0 Nov 23, 2021

ghost mentioned this issue Feb 1, 2022

parse inline assembly syntax according to a set of dialects; integrate inline assembly more closely with the zig language #10761

Open

andrewrk modified the milestones: 0.10.0, 0.11.0 Apr 16, 2022

andrewrk modified the milestones: 0.11.0, 0.12.0 Apr 9, 2023

andrewrk modified the milestones: 0.13.0, 0.12.0 Jul 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: New Inline Assembly #5241

Enhancement: New Inline Assembly #5241

ghost commented May 1, 2020 •

edited by ghost

lerno commented Aug 8, 2021

lerno commented Aug 9, 2021

lerno commented Aug 9, 2021

N00byEdge commented Feb 1, 2022 •

edited

ghost commented Feb 1, 2022

ethindp commented Jul 28, 2023

lerno commented Jul 31, 2023

ethindp commented Jul 31, 2023

lerno commented Jul 31, 2023

Enhancement: New Inline Assembly #5241

Enhancement: New Inline Assembly #5241

Comments

ghost commented May 1, 2020 • edited by ghost

New Inline Assembly

TL;DR: Benefits over Status Quo

Stack Machines

Meta

Volatile

Bindings

Operand

Symbol

Clobber

Wildcards

Body

Post Expression

Examples

lerno commented Aug 8, 2021

lerno commented Aug 9, 2021

lerno commented Aug 9, 2021

N00byEdge commented Feb 1, 2022 • edited

ghost commented Feb 1, 2022

ethindp commented Jul 28, 2023

lerno commented Jul 31, 2023

ethindp commented Jul 31, 2023

lerno commented Jul 31, 2023

ghost commented May 1, 2020 •

edited by ghost

N00byEdge commented Feb 1, 2022 •

edited