Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Function multi-versioning #1018

Open
bheads opened this issue May 17, 2018 · 24 comments
Open

Proposal: Function multi-versioning #1018

bheads opened this issue May 17, 2018 · 24 comments
Labels
accepted This proposal is planned. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Milestone

Comments

@bheads
Copy link

bheads commented May 17, 2018

A really interesting concept is function multi-versioning. The general idea is to support implementing multiple versions of a function for different hardware and having the correct version of the function selected at run time. Made up sample code:

pub  fn someMathFunction(vec: Vector) Vector [target: sse4.2]
{
   // optimized for SSE 4.2
}

pub  fn someMathFunction(vec: Vector) Vector [target: avx2]
{
   // optimized for avx2 
}

pub  fn someMathFunction(vec: Vector) Vector [target: default]
{
   // no asm/intrinsics optimization
}

// later on

const v = giveMeAVect();
const v2 = someMathFunction(v); // calls the best version based on run time selection

There are ways to simulate this using function pointers, but the compiler would be better at optimizing this, plus implementing that over and over by hand would suck.

LLVM https://llvm.org/docs/LangRef.html#ifuncs
GCC https://lwn.net/Articles/691932/

@andrewrk andrewrk added the proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. label May 17, 2018
@andrewrk andrewrk added this to the 0.4.0 milestone May 17, 2018
@andrewrk
Copy link
Member

This use case is in scope of Zig. Whether we use LLVM IFuncs or something else is still up for research, as well as the syntax and how it fits in with comptime and other features.

@bnoordhuis
Copy link
Contributor

Ifuncs only work on ELF platforms (and not all of them either) so Zig would need a fallback for platforms where they're not supported, such as Windows, FreeBSD < 12, etc.

I'd attack this bottom-up: start with a generic solution and later on add ifuncs as a compiler optimization.

@PavelVozenilek
Copy link

  1. In the build system allow targets with different properties (SSE/cache size/main memory size/GPU/...).
  2. Allow specialized constants/data structures/functions by these properties.
  3. Generate "fat executable" where all targets are generated into one file, and the most fitting target gets selected at the startup.
  4. For very specialized cases allow to offer own selection mechanism (by running a benchmark at the startup).

Using hidden function pointers set by the compiler for individual functions would hog cache friendliness and destroy potential for inlining.

@binary132
Copy link

binary132 commented Jun 15, 2018

Edit: Never mind. Thanks for clarifying @tiehuis!

The way this is solved in Go is with a build tag at the file level. When the build runs, some build properties are defined depending on the platform and optional build arguments. The entire file is then either included or excluded depending on its build tags. Build tags such as 386,!darwin are supported, which means the file will be built only for 386 AND NOT darwin, etc.

This is simple to use, extensible with custom tags, and forces the user to put all platform-specific stuff in files having a tag immediately at the top, so there's no custom build stuff intermixed with normal code.

@tiehuis
Copy link
Member

tiehuis commented Jun 15, 2018

@binary132 That is different to the issue here since it is a compile-time selection. This is possible in Zig right now by using standard if statements with comptime values. This issue is about function selection at runtime, based on runtime cpu feature detection.

This is useful to generate a single portable binary that runs on different cpu processor architectures (e.g. core 2 duo vs. i7 skylake), while still allowing cpu-specific performance optimizations for hot functions.

@0joshuaolson1
Copy link

function selection at runtime, based on runtime cpu feature detection

Or when using Vulkan? 😏

@bronze1man
Copy link

bronze1man commented Jul 6, 2018

I proposal follow way to solve this stuff to make the code more easy to read and write:

pub  fn someMathFunction(vec: Vector) Vector
{
    @useLLVMIFuncs
    if (builtin.hasSse42){
      ...
    }else if (builtin.hasAvx2){
      ...
    }else{
      ...
    }
}

And the compiler can generate 3 versions of this function and generate the LLVM IFuncs base on that information.

Solve multi-versioning problem like golang (build tag https://golang.org/pkg/go/build/#hdr-Build_Constraints ) make the code difficult to read and write and refactor. (Do I define all the need versions I need? Do I change all the function name of all the version? Where is the linux version of that function?)

@tiehuis
Copy link
Member

tiehuis commented Jul 7, 2018

I think the switch as specified by @bronze1man seems most in line with how zig would work, given it uses all standard syntax. Ignoring the ifunc optimization, first, we need cpu feature detection of sorts. I've implemented below some example code which provides runtime cpu feature detection to get a better idea of potential issues.

Example

// initialized globally somewhere, could do locally but minor cost
var cpu = CpuFeatures.init();

pub fn memchr(buf: *const u8, c: u8) usize {
    if (cpu.hasFeature(FeatureX86.avx512)) {
        return memchr_avx512(buf, c);
    } else if (cpu.hasFeature(FeatureX86.sse2)) {
        return memchr_sse2(buf, c);
    } else {
        return memchr_generic(buf, c);
    }
}

Implementation

cpuid.c

// This is done separately for now since zig's multi-return inline asm was a pain.
#include <cpuid.h>

int cpuid(unsigned int leaf, unsigned int *eax, unsigned int *ebx, unsigned int *ecx, unsigned int *edx)
{
    return __get_cpuid(leaf, eax, ebx, ecx, edx);
}

cpu.zig

const std = @import("std");
const builtin = @import("builtin");

// Runtime cpu feature detection.
//
// This is currently implemented for x86/x64 targets. For generic targets, the features
// returned will be compile-time false and will not use any code space.

comptime {
    std.debug.assert(@memberCount(FeatureX86) == 224);
}

// See https://en.wikipedia.org/wiki/CPUID
pub const FeatureX86 = enum {
    // eax = 1, output: edx
    fpu,
    vme,
    de,
    pse,
    tsc,
    msr,
    pae,
    mce,
    cx8,
    apic,
    _reserved1,
    sep,
    mtrr,
    pge,
    mca,
    cmov,
    pat,
    pse_36,
    psn,
    clfsh,
    _reserved2,
    ds,
    acpi,
    mmx,
    fxsr,
    sse,
    sse2,
    ss,
    htt,
    tm,
    ia64,
    pbe,

    // omitted remaining features for brevity
    ....
};

// Implemented in C until multi-output asm is easier. See: #215.
extern fn cpuid(leaf: c_uint, eax: *c_uint, ebx: *c_uint, ecx: *c_uint, edx: *c_uint) c_int;

pub const CpuFeatures = struct {
    buf: [7]u32,

    pub fn init() CpuFeatures {
        var self = CpuFeatures{ .buf = []u32{0} ** 7 };

        switch (builtin.arch) {
            builtin.Arch.i386, builtin.Arch.x86_64 => {
                var eax: c_uint = undefined;
                var ebx: c_uint = undefined;
                var ecx: c_uint = undefined;
                var edx: c_uint = undefined;

                // We don't strictly need to check this since __get_cpuid does but our
                // implementation may not.
                const max_basic_cpu_leaf = cpuid(0, &eax, &ebx, &ecx, &edx);

                if (max_basic_cpu_leaf >= 1) {
                    if (cpuid(1, &eax, &ebx, &ecx, &edx) == 1) {
                        self.buf[0] = edx;
                        self.buf[1] = ecx;
                    }
                }

                if (max_basic_cpu_leaf >= 7) {
                    if (cpuid(7, &eax, &ebx, &ecx, &edx) == 1) {
                        self.buf[2] = ebx;
                        self.buf[3] = ecx;
                        self.buf[4] = edx;
                    }
                }

                const max_ext_cpu_leaf = cpuid(0x80000000, &eax, &ebx, &ecx, &edx);

                if (max_ext_cpu_leaf >= 1) {
                    if (cpuid(0x80000001, &eax, &ebx, &ecx, &edx) == 1) {
                        self.buf[5] = edx;
                        self.buf[6] = ecx;
                    }
                }
            },
            else => {
                // other targets return false for `hasFeature`
            },
        }

        return self;
    }

    // This would actually take a var, which would accept any platform Feature, e.g.
    // FeatureX86 or FeatureARM (is that applicable?).
    //
    // If compiling for a target without this feature we know at comptime that we can never
    // execute that feature and no arch-specific code is included.
    pub inline fn hasFeature(self: *const CpuFeatures, feature: FeatureX86) bool {
        // We require #868 for allowing runtime-selected functions to be allowed to run at
        // compile-time. Assuming something like D and its __ctfe for the moment.
        if (__ctfe) {
            return false;
        }

        switch (builtin.arch) {
            builtin.Arch.i386, builtin.Arch.x86_64 => {
                const n = @enumToInt(feature);
                return (self.buf[n >> 5] & (u32(1) << @intCast(u5, n & 0x1f))) != 0;
            },
            else => {
                return false;
            },
        }
    }
};

Notes

Non-compatible architectures where the feature does not work are compile-time known which allows us to avoid compiling in incompatible branches.

For the following example:

pub fn main() void {
    const cpu = CpuFeatures.init();

    if (cpu.hasFeature(FeatureX86.sse42)) {
        @compileLog("not allowed to compile to target which may have sse");
    } else {
        std.debug.warn("no sse\n");
    }
}
$ zig build-exe cpu.zig
/home/me/src/cpuid/cpu.zig:356:9: error: not allowed to compile to target which may have sse
        @compileError("not allowed to compile to target which may have sse");

$ zig build-exe cpu.zig --target-arch armv7
# all ok

We require #868 to allow for suitable fallback implementations where not otherwise specified. Unfortunately I don't think this can be avoided in this case.

Links

@andrewrk andrewrk added the accepted This proposal is planned. label Nov 21, 2018
@emekoi
Copy link
Contributor

emekoi commented Feb 4, 2019

does llvm have any cpu feature capability detection features we could leverage instead of rolling our own?

@shawnl
Copy link
Contributor

shawnl commented Apr 22, 2019

does llvm have any cpu feature capability detection features we could leverage instead of rolling our own?

yes, but it is x86-specific and we already have it in the zig tree as ./c_headers/cpuid.h. For other architectures it may be (Linux-specific) /proc/cpuinfo. IFUNC has yet to be ported to non-x86 architectures.

@shawnl
Copy link
Contributor

shawnl commented Jun 26, 2019

We require #868 to allow for suitable fallback implementations where not otherwise specified. Unfortunately I don't think this can be avoided in this case.

We should probably model this after Linux's AT_HWCAP and AT_HWCAP2, which is not needed on the CSICy x86, but is needed on most other arches.

@daurnimator
Copy link
Contributor

See also the GCC target_clones function attribute: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-target_005fclones-function-attribute

The target_clones attribute is used to specify that a function be cloned into multiple versions compiled with different target options than specified on the command line. The supported options and restrictions are the same as for target attribute.

For instance, on an x86, you could compile a function with target_clones("sse4.1,avx"). GCC creates two function clones, one compiled with -msse4.1 and another with -mavx.

On a PowerPC, you can compile a function with target_clones("cpu=power9,default"). GCC will create two function clones, one compiled with -mcpu=power9 and another with the default options. GCC must be configured to use GLIBC 2.23 or newer in order to use the target_clones attribute.

It also creates a resolver function (see the ifunc attribute above) that dynamically selects a clone suitable for current architecture. The resolver is created only if there is a usage of a function with target_clones attribute.

Note that any subsequent call of a function without target_clone from a target_clone caller will not lead to copying (target clone) of the called function. If you want to enforce such behaviour, we recommend declaring the calling function with the flatten attribute?

@andrewrk andrewrk added this to the 0.9.0 milestone May 19, 2021
@andrewrk andrewrk modified the milestones: 0.9.0, 0.10.0 Nov 20, 2021
@andrewrk andrewrk modified the milestones: 0.10.0, 0.11.0 Apr 16, 2022
@zxubian
Copy link

zxubian commented Oct 30, 2022

I think there should be a way to disable this at compile-time.

The use cases discussed so far assume you don't know the specifics of your target CPU. However, there are cases when you write code knowing exactly what hardware it will run on.

If the standard library starts using function multi-versioning, yet you know at compile time that you will only need one specific version, you should be able to select that version at compile time and make sure that no others make it to the binary.

Otherwise, you'll waste executable size (and potentially cache penalties if this is implemented via pointers).

@Jarred-Sumner
Copy link
Contributor

This has been a big pain point for Bun.

Our current solution is to compile two versions of bun for Linux x64 and macOS x64. One which targets haswell and one which targets westmere. We use an install script that detects if the computer supports AVX2 and chooses the right binary. We recompile all dependencies with the same CPU target set.

This unfortunately breaks when someone installs Bun without using the official install script that checks for this - like if using a package manager for a Linux distribution. Having to separate by CPU features makes it more difficult for people to distribute Bun.

@mitchellh
Copy link
Sponsor Contributor

mitchellh commented Nov 14, 2022

@Jarred-Sumner I'm interested in this issue too but if you have certain specialized hot paths you can do runtime simd detection to avoid having to build two binaries. Its a bit of work but it isn't too bad. You have to abandon all of the Zig intrinsics and write raw assembly but I've been doing it some for some SIMD work and its been going along fine. If you're trying to get the compiler to auto-vectorize into special instructions though, then yeah, this is a pain 😄 Ping me on discord if you need help I have some example code I can share with you.

Edit: here is how simdjson does ISA detection: https://sourcegraph.com/github.com/simdjson/simdjson/-/blob/include/simdjson/internal/isadetection.h Pair something like that with a Zig dynamic dispatch interface (ptr + vtable) and its pretty much just as fast if the interface is right.

@daurnimator
Copy link
Contributor

Edit: here is how simdjson does ISA detection: https://sourcegraph.com/github.com/simdjson/simdjson/-/blob/include/simdjson/internal/isadetection.h

FYI zig already has cpuid code in the standard library

pub fn detectNativeCpuAndFeatures(arch: Target.Cpu.Arch, os: Target.Os, cross_target: CrossTarget) Target.Cpu {

Though only the high level function is pub.

@mitchellh
Copy link
Sponsor Contributor

Helpful! Its much faster to just pull out the bits you care about than to make a generalized "parse all features of cpuid" library. I've always found it easy to just do the cpuid stuff myself and do only the bit manipulation at the start of the program that I'll need for the rest of it.

@andrewrk andrewrk modified the milestones: 0.11.0, 0.12.0 Apr 9, 2023
@ominitay
Copy link
Contributor

When implemented, this feature should probably allow selecting based not only on CPU features, but on the CPU family too. An example of the utility of this would be how AMD's processors implemented the BMI2 PDEP/PEXT instructions in microcode until Zen 3. This made the instructions incredibly slow, and therefore unfit for purpose. A project would likely want to treat this case as-if the instructions weren't available at all.

@andrewrk andrewrk modified the milestones: 0.13.0, 0.12.0 Jun 29, 2023
@ghost
Copy link

ghost commented Jul 14, 2023

This is such an extremely obscure thing to want that I think it makes perfect sense to require implementing it manually via function pointers.

@nektro
Copy link
Contributor

nektro commented Jul 14, 2023

better yet,

pub fn someMathFunction(vec: Vector) Vector {
    if (builtin.cpu.arch == .x86_64) {
        if (comptime std.Target.x86.featureSetHas(builtin.cpu.features, .sse4_2)) {
            // ...
            // optimized for SSE 4.2
            //
            return;
        }
        if (comptime std.Target.x86.featureSetHas(builtin.cpu.features, .avx2)) {
            // ...
            // optimized for avx2
            //
            return;
        }
        // ...
        // no asm/intrinsics optimization
        //
    }
}

@mitchellh
Copy link
Sponsor Contributor

@nektro: better yet,

For this original issue, this might work, but I just want to note that this won't work for the imo more useful and general case of trying to compile software for a baseline target for whatever arch and runtime detecting feature sets. The builtin features are only going to have it if you compile with a target that has those set.

In my use case (which I don't think this issue is trying to solve and function pointers are probably the way to go, being very clear!), I want to be able to build and package my software such that it works on any x86_64 arch but can runtime detect supported features.

I end up writing something that looks like this:

/// This is purposely incomplete, just an example for comment.
const ISA = enum {
    generic,
    neon,
    avx2,

    /// Detect the ISA to use using compile-time information as well
    /// as runtime information (i.e. cpuid).
    pub fn detect() !ISA {
        return switch (builtin.cpu.arch) {
            // Neon is mandatory on aarch64. No runtime checks necessary.
            .aarch64 => .neon,

            // X86 we have to call out to cpuid
            // Note: I don't have .x86/.i386 here because there was a
            // recent change to the name of the enum. But, in general,
            // you'd have x86 here too.
            inline .x86_64 => detectX86(),

            // Unknown, assume generic
            else => .generic,
        };
    }

    fn detectX86() ISA {
        // Magic constants below come from Intel ISA Vol 2A 3-218
        const id = cpuid.initEx(7, 0);
        if (id.ebx & (@as(c_uint, 1) << 5) > 0) return .avx2;
        return .generic;
    }
};

@andrewrk
Copy link
Member

To be clear @mitchellh your use case is exactly what this issue is intended to solve.

Example would be swapping out memcpy on program startup to an implementation that takes full advantage of the CPU features detected at runtime.

The main challenge is doing so in a way that does not cause function calls to become virtual and thereby compromise perf which is the main purpose of this feature.

@sharpobject
Copy link
Contributor

sharpobject commented Jan 2, 2024

This is such an extremely obscure thing to want that I think it makes perfect sense to require implementing it manually via function pointers.

The reason this wouldn't be great is that users of this feature probably want at least some monomorphization of callers of multi-versioned functions. @andrewrk gave an example of memcpy, which should probably both be inlined into the caller and be a different implementation depending on the CPU.

It's also fairly difficult to do it manually right now. Within your own code, you can pass around a comptime argument that specifies the level of ISA extensions to specialize to, but if you end up using libraries that take callbacks that do not pass back a user-provided comptime ctx, then you lose this information at library boundaries. edit: I guess this is actually doable by putting your callback function in a struct that has a member that specifies the level of extensions. But this will only get you specialization of your own SIMD kernels, not memcpy or similar.

I spitballed that one could try implementing multi-versioning manually by mutating a comptime global (does this even work?) but that also breaks because comptime function evaluations are cached based on the values of their arguments, not the values of their arguments plus all the globals they read or could read.

@InKryption
Copy link
Contributor

I spitballed that one could try this by mutating a comptime global (does this even work?)

No, global mutable comptime memory is explicitly disallowed, as per #7396.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted This proposal is planned. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Projects
None yet
Development

No branches or pull requests