Proposal: Function multi-versioning #1018

bheads · 2018-05-17T18:20:37Z

A really interesting concept is function multi-versioning. The general idea is to support implementing multiple versions of a function for different hardware and having the correct version of the function selected at run time. Made up sample code:

pub  fn someMathFunction(vec: Vector) Vector [target: sse4.2]
{
   // optimized for SSE 4.2
}

pub  fn someMathFunction(vec: Vector) Vector [target: avx2]
{
   // optimized for avx2 
}

pub  fn someMathFunction(vec: Vector) Vector [target: default]
{
   // no asm/intrinsics optimization
}

// later on

const v = giveMeAVect();
const v2 = someMathFunction(v); // calls the best version based on run time selection

There are ways to simulate this using function pointers, but the compiler would be better at optimizing this, plus implementing that over and over by hand would suck.

LLVM https://llvm.org/docs/LangRef.html#ifuncs
GCC https://lwn.net/Articles/691932/

andrewrk · 2018-05-18T04:14:53Z

This use case is in scope of Zig. Whether we use LLVM IFuncs or something else is still up for research, as well as the syntax and how it fits in with comptime and other features.

bnoordhuis · 2018-05-18T10:05:47Z

Ifuncs only work on ELF platforms (and not all of them either) so Zig would need a fallback for platforms where they're not supported, such as Windows, FreeBSD < 12, etc.

I'd attack this bottom-up: start with a generic solution and later on add ifuncs as a compiler optimization.

PavelVozenilek · 2018-05-19T20:39:03Z

In the build system allow targets with different properties (SSE/cache size/main memory size/GPU/...).
Allow specialized constants/data structures/functions by these properties.
Generate "fat executable" where all targets are generated into one file, and the most fitting target gets selected at the startup.
For very specialized cases allow to offer own selection mechanism (by running a benchmark at the startup).

Using hidden function pointers set by the compiler for individual functions would hog cache friendliness and destroy potential for inlining.

binary132 · 2018-06-15T13:34:15Z

Edit: Never mind. Thanks for clarifying @tiehuis!

The way this is solved in Go is with a build tag at the file level. When the build runs, some build properties are defined depending on the platform and optional build arguments. The entire file is then either included or excluded depending on its build tags. Build tags such as 386,!darwin are supported, which means the file will be built only for 386 AND NOT darwin, etc.

This is simple to use, extensible with custom tags, and forces the user to put all platform-specific stuff in files having a tag immediately at the top, so there's no custom build stuff intermixed with normal code.

tiehuis · 2018-06-15T13:47:55Z

@binary132 That is different to the issue here since it is a compile-time selection. This is possible in Zig right now by using standard if statements with comptime values. This issue is about function selection at runtime, based on runtime cpu feature detection.

This is useful to generate a single portable binary that runs on different cpu processor architectures (e.g. core 2 duo vs. i7 skylake), while still allowing cpu-specific performance optimizations for hot functions.

0joshuaolson1 · 2018-06-15T15:14:41Z

function selection at runtime, based on runtime cpu feature detection

Or when using Vulkan? 😏

bronze1man · 2018-07-06T14:39:41Z

I proposal follow way to solve this stuff to make the code more easy to read and write:

pub  fn someMathFunction(vec: Vector) Vector
{
    @useLLVMIFuncs
    if (builtin.hasSse42){
      ...
    }else if (builtin.hasAvx2){
      ...
    }else{
      ...
    }
}

And the compiler can generate 3 versions of this function and generate the LLVM IFuncs base on that information.

Solve multi-versioning problem like golang (build tag https://golang.org/pkg/go/build/#hdr-Build_Constraints ) make the code difficult to read and write and refactor. (Do I define all the need versions I need? Do I change all the function name of all the version? Where is the linux version of that function?)

tiehuis · 2018-07-07T10:11:25Z

I think the switch as specified by @bronze1man seems most in line with how zig would work, given it uses all standard syntax. Ignoring the ifunc optimization, first, we need cpu feature detection of sorts. I've implemented below some example code which provides runtime cpu feature detection to get a better idea of potential issues.

Example

// initialized globally somewhere, could do locally but minor cost
var cpu = CpuFeatures.init();

pub fn memchr(buf: *const u8, c: u8) usize {
    if (cpu.hasFeature(FeatureX86.avx512)) {
        return memchr_avx512(buf, c);
    } else if (cpu.hasFeature(FeatureX86.sse2)) {
        return memchr_sse2(buf, c);
    } else {
        return memchr_generic(buf, c);
    }
}

Implementation

cpuid.c

// This is done separately for now since zig's multi-return inline asm was a pain.
#include <cpuid.h>

int cpuid(unsigned int leaf, unsigned int *eax, unsigned int *ebx, unsigned int *ecx, unsigned int *edx)
{
    return __get_cpuid(leaf, eax, ebx, ecx, edx);
}

cpu.zig

const std = @import("std");
const builtin = @import("builtin");

// Runtime cpu feature detection.
//
// This is currently implemented for x86/x64 targets. For generic targets, the features
// returned will be compile-time false and will not use any code space.

comptime {
    std.debug.assert(@memberCount(FeatureX86) == 224);
}

// See https://en.wikipedia.org/wiki/CPUID
pub const FeatureX86 = enum {
    // eax = 1, output: edx
    fpu,
    vme,
    de,
    pse,
    tsc,
    msr,
    pae,
    mce,
    cx8,
    apic,
    _reserved1,
    sep,
    mtrr,
    pge,
    mca,
    cmov,
    pat,
    pse_36,
    psn,
    clfsh,
    _reserved2,
    ds,
    acpi,
    mmx,
    fxsr,
    sse,
    sse2,
    ss,
    htt,
    tm,
    ia64,
    pbe,

    // omitted remaining features for brevity
    ....
};

// Implemented in C until multi-output asm is easier. See: #215.
extern fn cpuid(leaf: c_uint, eax: *c_uint, ebx: *c_uint, ecx: *c_uint, edx: *c_uint) c_int;

pub const CpuFeatures = struct {
    buf: [7]u32,

    pub fn init() CpuFeatures {
        var self = CpuFeatures{ .buf = []u32{0} ** 7 };

        switch (builtin.arch) {
            builtin.Arch.i386, builtin.Arch.x86_64 => {
                var eax: c_uint = undefined;
                var ebx: c_uint = undefined;
                var ecx: c_uint = undefined;
                var edx: c_uint = undefined;

                // We don't strictly need to check this since __get_cpuid does but our
                // implementation may not.
                const max_basic_cpu_leaf = cpuid(0, &eax, &ebx, &ecx, &edx);

                if (max_basic_cpu_leaf >= 1) {
                    if (cpuid(1, &eax, &ebx, &ecx, &edx) == 1) {
                        self.buf[0] = edx;
                        self.buf[1] = ecx;
                    }
                }

                if (max_basic_cpu_leaf >= 7) {
                    if (cpuid(7, &eax, &ebx, &ecx, &edx) == 1) {
                        self.buf[2] = ebx;
                        self.buf[3] = ecx;
                        self.buf[4] = edx;
                    }
                }

                const max_ext_cpu_leaf = cpuid(0x80000000, &eax, &ebx, &ecx, &edx);

                if (max_ext_cpu_leaf >= 1) {
                    if (cpuid(0x80000001, &eax, &ebx, &ecx, &edx) == 1) {
                        self.buf[5] = edx;
                        self.buf[6] = ecx;
                    }
                }
            },
            else => {
                // other targets return false for `hasFeature`
            },
        }

        return self;
    }

    // This would actually take a var, which would accept any platform Feature, e.g.
    // FeatureX86 or FeatureARM (is that applicable?).
    //
    // If compiling for a target without this feature we know at comptime that we can never
    // execute that feature and no arch-specific code is included.
    pub inline fn hasFeature(self: *const CpuFeatures, feature: FeatureX86) bool {
        // We require #868 for allowing runtime-selected functions to be allowed to run at
        // compile-time. Assuming something like D and its __ctfe for the moment.
        if (__ctfe) {
            return false;
        }

        switch (builtin.arch) {
            builtin.Arch.i386, builtin.Arch.x86_64 => {
                const n = @enumToInt(feature);
                return (self.buf[n >> 5] & (u32(1) << @intCast(u5, n & 0x1f))) != 0;
            },
            else => {
                return false;
            },
        }
    }
};

Notes

Non-compatible architectures where the feature does not work are compile-time known which allows us to avoid compiling in incompatible branches.

For the following example:

pub fn main() void {
    const cpu = CpuFeatures.init();

    if (cpu.hasFeature(FeatureX86.sse42)) {
        @compileLog("not allowed to compile to target which may have sse");
    } else {
        std.debug.warn("no sse\n");
    }
}

$ zig build-exe cpu.zig
/home/me/src/cpuid/cpu.zig:356:9: error: not allowed to compile to target which may have sse
        @compileError("not allowed to compile to target which may have sse");

$ zig build-exe cpu.zig --target-arch armv7
# all ok

We require #868 to allow for suitable fallback implementations where not otherwise specified. Unfortunately I don't think this can be avoided in this case.

Links

emekoi · 2019-02-04T15:39:26Z

does llvm have any cpu feature capability detection features we could leverage instead of rolling our own?

shawnl · 2019-04-22T00:33:30Z

does llvm have any cpu feature capability detection features we could leverage instead of rolling our own?

yes, but it is x86-specific and we already have it in the zig tree as ./c_headers/cpuid.h. ~~For other architectures it may be (Linux-specific) /proc/cpuinfo. IFUNC has yet to be ported to non-x86 architectures.~~

shawnl · 2019-06-26T19:02:15Z

We require #868 to allow for suitable fallback implementations where not otherwise specified. Unfortunately I don't think this can be avoided in this case.

We should probably model this after Linux's AT_HWCAP and AT_HWCAP2, which is not needed on the CSICy x86, but is needed on most other arches.

daurnimator · 2020-03-29T13:22:26Z

See also the GCC target_clones function attribute: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-target_005fclones-function-attribute

The target_clones attribute is used to specify that a function be cloned into multiple versions compiled with different target options than specified on the command line. The supported options and restrictions are the same as for target attribute.

For instance, on an x86, you could compile a function with target_clones("sse4.1,avx"). GCC creates two function clones, one compiled with -msse4.1 and another with -mavx.

On a PowerPC, you can compile a function with target_clones("cpu=power9,default"). GCC will create two function clones, one compiled with -mcpu=power9 and another with the default options. GCC must be configured to use GLIBC 2.23 or newer in order to use the target_clones attribute.

It also creates a resolver function (see the ifunc attribute above) that dynamically selects a clone suitable for current architecture. The resolver is created only if there is a usage of a function with target_clones attribute.

Note that any subsequent call of a function without target_clone from a target_clone caller will not lead to copying (target clone) of the called function. If you want to enforce such behaviour, we recommend declaring the calling function with the flatten attribute?

zxubian · 2022-10-30T01:25:09Z

I think there should be a way to disable this at compile-time.

The use cases discussed so far assume you don't know the specifics of your target CPU. However, there are cases when you write code knowing exactly what hardware it will run on.

If the standard library starts using function multi-versioning, yet you know at compile time that you will only need one specific version, you should be able to select that version at compile time and make sure that no others make it to the binary.

Otherwise, you'll waste executable size (and potentially cache penalties if this is implemented via pointers).

Jarred-Sumner · 2022-11-14T23:53:44Z

This has been a big pain point for Bun.

Our current solution is to compile two versions of bun for Linux x64 and macOS x64. One which targets haswell and one which targets westmere. We use an install script that detects if the computer supports AVX2 and chooses the right binary. We recompile all dependencies with the same CPU target set.

This unfortunately breaks when someone installs Bun without using the official install script that checks for this - like if using a package manager for a Linux distribution. Having to separate by CPU features makes it more difficult for people to distribute Bun.

mitchellh · 2022-11-14T23:56:07Z

@Jarred-Sumner I'm interested in this issue too but if you have certain specialized hot paths you can do runtime simd detection to avoid having to build two binaries. Its a bit of work but it isn't too bad. You have to abandon all of the Zig intrinsics and write raw assembly but I've been doing it some for some SIMD work and its been going along fine. If you're trying to get the compiler to auto-vectorize into special instructions though, then yeah, this is a pain 😄 Ping me on discord if you need help I have some example code I can share with you.

Edit: here is how simdjson does ISA detection: https://sourcegraph.com/github.com/simdjson/simdjson/-/blob/include/simdjson/internal/isadetection.h Pair something like that with a Zig dynamic dispatch interface (ptr + vtable) and its pretty much just as fast if the interface is right.

daurnimator · 2022-11-15T00:32:47Z

Edit: here is how simdjson does ISA detection: https://sourcegraph.com/github.com/simdjson/simdjson/-/blob/include/simdjson/internal/isadetection.h

FYI zig already has cpuid code in the standard library

zig/lib/std/zig/system/x86.zig

Line 25 in 694d883

    
           pub fn detectNativeCpuAndFeatures(arch: Target.Cpu.Arch, os: Target.Os, cross_target: CrossTarget) Target.Cpu {

Though only the high level function is pub.

mitchellh · 2022-11-15T00:34:41Z

Helpful! Its much faster to just pull out the bits you care about than to make a generalized "parse all features of cpuid" library. I've always found it easy to just do the cpuid stuff myself and do only the bit manipulation at the start of the program that I'll need for the rest of it.

ominitay · 2023-04-20T13:31:51Z

When implemented, this feature should probably allow selecting based not only on CPU features, but on the CPU family too. An example of the utility of this would be how AMD's processors implemented the BMI2 PDEP/PEXT instructions in microcode until Zen 3. This made the instructions incredibly slow, and therefore unfit for purpose. A project would likely want to treat this case as-if the instructions weren't available at all.

ghost · 2023-07-14T08:16:08Z

This is such an extremely obscure thing to want that I think it makes perfect sense to require implementing it manually via function pointers.

nektro · 2023-07-14T16:07:14Z

better yet,

pub fn someMathFunction(vec: Vector) Vector {
    if (builtin.cpu.arch == .x86_64) {
        if (comptime std.Target.x86.featureSetHas(builtin.cpu.features, .sse4_2)) {
            // ...
            // optimized for SSE 4.2
            //
            return;
        }
        if (comptime std.Target.x86.featureSetHas(builtin.cpu.features, .avx2)) {
            // ...
            // optimized for avx2
            //
            return;
        }
        // ...
        // no asm/intrinsics optimization
        //
    }
}

mitchellh · 2023-07-14T16:13:20Z

@nektro: better yet,

For this original issue, this might work, but I just want to note that this won't work for the imo more useful and general case of trying to compile software for a baseline target for whatever arch and runtime detecting feature sets. The builtin features are only going to have it if you compile with a target that has those set.

In my use case (which I don't think this issue is trying to solve and function pointers are probably the way to go, being very clear!), I want to be able to build and package my software such that it works on any x86_64 arch but can runtime detect supported features.

I end up writing something that looks like this:

/// This is purposely incomplete, just an example for comment.
const ISA = enum {
    generic,
    neon,
    avx2,

    /// Detect the ISA to use using compile-time information as well
    /// as runtime information (i.e. cpuid).
    pub fn detect() !ISA {
        return switch (builtin.cpu.arch) {
            // Neon is mandatory on aarch64. No runtime checks necessary.
            .aarch64 => .neon,

            // X86 we have to call out to cpuid
            // Note: I don't have .x86/.i386 here because there was a
            // recent change to the name of the enum. But, in general,
            // you'd have x86 here too.
            inline .x86_64 => detectX86(),

            // Unknown, assume generic
            else => .generic,
        };
    }

    fn detectX86() ISA {
        // Magic constants below come from Intel ISA Vol 2A 3-218
        const id = cpuid.initEx(7, 0);
        if (id.ebx & (@as(c_uint, 1) << 5) > 0) return .avx2;
        return .generic;
    }
};

andrewrk · 2023-07-14T18:51:49Z

To be clear @mitchellh your use case is exactly what this issue is intended to solve.

Example would be swapping out memcpy on program startup to an implementation that takes full advantage of the CPU features detected at runtime.

The main challenge is doing so in a way that does not cause function calls to become virtual and thereby compromise perf which is the main purpose of this feature.

sharpobject · 2024-01-02T15:46:48Z

This is such an extremely obscure thing to want that I think it makes perfect sense to require implementing it manually via function pointers.

The reason this wouldn't be great is that users of this feature probably want at least some monomorphization of callers of multi-versioned functions. @andrewrk gave an example of memcpy, which should probably both be inlined into the caller and be a different implementation depending on the CPU.

It's also fairly difficult to do it manually right now. Within your own code, you can pass around a comptime argument that specifies the level of ISA extensions to specialize to, but if you end up using libraries that take callbacks that do not pass back a user-provided comptime ctx, then you lose this information at library boundaries. edit: I guess this is actually doable by putting your callback function in a struct that has a member that specifies the level of extensions. But this will only get you specialization of your own SIMD kernels, not memcpy or similar.

I spitballed that one could try implementing multi-versioning manually by mutating a comptime global (does this even work?) but that also breaks because comptime function evaluations are cached based on the values of their arguments, not the values of their arguments plus all the globals they read or could read.

InKryption · 2024-01-02T15:52:46Z

I spitballed that one could try this by mutating a comptime global (does this even work?)

No, global mutable comptime memory is explicitly disallowed, as per #7396.

andrewrk added the proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. label May 17, 2018

andrewrk added this to the 0.4.0 milestone May 17, 2018

andrewrk added the accepted This proposal is planned. label Nov 21, 2018

andrewrk mentioned this issue Feb 5, 2019

when linking libc, expose libc version in builtin import, and only call c functions in std library when linking libc whose version is known to have the function #397

Closed

andrewrk modified the milestones: 0.4.0, 0.5.0 Mar 1, 2019

andrewrk mentioned this issue Apr 29, 2019

compiler-rt: Add __divsi3, __aeabi_idiv #2372

Merged

andrewrk modified the milestones: 0.5.0, 0.6.0 Aug 22, 2019

data-man mentioned this issue Nov 7, 2019

Proposal: @cpuid builtin #3622

Closed

andrewrk modified the milestones: 0.6.0, 0.7.0 Dec 9, 2019

andrewrk mentioned this issue Dec 18, 2019

Add support for target details (CPUs and their supported features) #3927

Closed

daurnimator mentioned this issue Aug 15, 2020

Improve curve25519-based crypto #6050

Closed

andrewrk mentioned this issue Aug 20, 2020

Breaking: sort std/crypto functions into categories #6095

Merged

andrewrk modified the milestones: 0.7.0, 0.8.0 Oct 9, 2020

andrewrk removed this from the 0.8.0 milestone May 19, 2021

andrewrk added this to the 0.9.0 milestone May 19, 2021

rtfeldman mentioned this issue Jun 16, 2021

builtins64.bc and builtins32.bc roc-lang/roc#1410

Open

andrewrk modified the milestones: 0.9.0, 0.10.0 Nov 20, 2021

andrewrk modified the milestones: 0.10.0, 0.11.0 Apr 16, 2022

andrewrk modified the milestones: 0.11.0, 0.12.0 Apr 9, 2023

andrewrk modified the milestones: 0.13.0, 0.12.0 Jun 29, 2023

Jarred-Sumner mentioned this issue Oct 2, 2023

Warn at start when using AVX build of Bun without AVX support oven-sh/bun#6242

Merged

andrewrk mentioned this issue Oct 31, 2023

Add SIMD Support #903

Open

mitchellh mentioned this issue Jan 30, 2024

do not analyze functions with pointer if they're not called from comptime or used at runtime #18753

Open

daurnimator mentioned this issue May 28, 2024

compiler-rt: Provide __cpu_indicator_init, __cpu_model and __cpu_features2 #20081

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Function multi-versioning #1018

Proposal: Function multi-versioning #1018

bheads commented May 17, 2018 •

edited

Loading

andrewrk commented May 18, 2018

bnoordhuis commented May 18, 2018

PavelVozenilek commented May 19, 2018

binary132 commented Jun 15, 2018 •

edited

Loading

tiehuis commented Jun 15, 2018

0joshuaolson1 commented Jun 15, 2018

bronze1man commented Jul 6, 2018 •

edited

Loading

tiehuis commented Jul 7, 2018 •

edited

Loading

emekoi commented Feb 4, 2019

shawnl commented Apr 22, 2019 •

edited

Loading

shawnl commented Jun 26, 2019

daurnimator commented Mar 29, 2020

zxubian commented Oct 30, 2022

Jarred-Sumner commented Nov 14, 2022

mitchellh commented Nov 14, 2022 •

edited

Loading

daurnimator commented Nov 15, 2022

mitchellh commented Nov 15, 2022

ominitay commented Apr 20, 2023

ghost commented Jul 14, 2023

nektro commented Jul 14, 2023

mitchellh commented Jul 14, 2023

andrewrk commented Jul 14, 2023

sharpobject commented Jan 2, 2024 •

edited

Loading

InKryption commented Jan 2, 2024

Proposal: Function multi-versioning #1018

Proposal: Function multi-versioning #1018

Comments

bheads commented May 17, 2018 • edited Loading

andrewrk commented May 18, 2018

bnoordhuis commented May 18, 2018

PavelVozenilek commented May 19, 2018

binary132 commented Jun 15, 2018 • edited Loading

tiehuis commented Jun 15, 2018

0joshuaolson1 commented Jun 15, 2018

bronze1man commented Jul 6, 2018 • edited Loading

tiehuis commented Jul 7, 2018 • edited Loading

Example

Implementation

Notes

Links

emekoi commented Feb 4, 2019

shawnl commented Apr 22, 2019 • edited Loading

shawnl commented Jun 26, 2019

daurnimator commented Mar 29, 2020

zxubian commented Oct 30, 2022

Jarred-Sumner commented Nov 14, 2022

mitchellh commented Nov 14, 2022 • edited Loading

daurnimator commented Nov 15, 2022

mitchellh commented Nov 15, 2022

ominitay commented Apr 20, 2023

ghost commented Jul 14, 2023

nektro commented Jul 14, 2023

mitchellh commented Jul 14, 2023

andrewrk commented Jul 14, 2023

sharpobject commented Jan 2, 2024 • edited Loading

InKryption commented Jan 2, 2024

bheads commented May 17, 2018 •

edited

Loading

binary132 commented Jun 15, 2018 •

edited

Loading

bronze1man commented Jul 6, 2018 •

edited

Loading

tiehuis commented Jul 7, 2018 •

edited

Loading

shawnl commented Apr 22, 2019 •

edited

Loading

mitchellh commented Nov 14, 2022 •

edited

Loading

sharpobject commented Jan 2, 2024 •

edited

Loading