Handling of heterogeneous computing #10397

gwenzek · 2021-12-23T10:35:16Z

This issue is a follow up on my own issue/experience working to add NVPTX (Nvidia GPU) support to Zig stage2 (#10189 ) and @Snektron ongoing work on SpirV backend (#2683) (also stage2). But the same considerations also to other heterogeneous programming like FPGA.

One of the issue with this "targets" is that they aren't standalone, typically those devices are driven by a more conventional CPU, the host. The host code will first push binary code to the device and then asks the device to execute some functions from this binary on some runtime-known inputs. This creates issues because we need to generate binaries that are consistent across two different targets. Notably what are the types allowed in the host/device interface ?

Main questions

So the questions I'd like to talk about are:

Should Zig handle heterogeneous programming ? (I think yes given SPIR-V target? #2683 approval and reactions to my previous work on this)
Should we keep only the C ABI between host and device, or allow for arbitrary Zig structs ? (I'm arguing for Zig structs)
In general which level of integration between the host/device code can be expected ?
Should this integration live in the Zig compiler or in libraries ?
What can Zig compiler do to help the libraries ?

Personally I'd be quite happy if Zig could provide minimal support for ergonomic heterogeneous programming (not constrained by C-ABI), even if there are quirks and only support some combination of platform, so that we can learn more about the issues and then later try to iron those outs.

gwenzek · 2021-12-23T10:36:31Z

In this message I describe how we can provide support for ergonomic heterogeneous programming,
looking at two implementations: Clang and my own Cudaz/Stage 2 fork.

Clang is already providing support for heterogeneous programming and imposes no constraint on what types can be used at the interface between host and device. Let's look at how they handle this (the schema shows Spir-V, but this matches what clang does for targeting NVPTX).

(from https://intel.github.io/llvm-docs/CompilerAndRuntimeDesign.html#dpc-compiler-architecture)

The idea is that the compiler:

extracts the "device" code from a regular source file
compiles it to the device binary format
generates a host object file that embeds the device binary, and contains minimal functions to push the binary to the device and to call the device code from the host (using an offload bundler )

The advantage of this approach is that the object files are standalone, and that the device code is hidden. In particular the ABI between the host and device isn't exposed.
The downside is that it makes the compiler architecture more complex. Probably not a big issue for clang, but maybe more for Zig which has less man-power.

Now let's look at the approach I used for Cudaz, the library I'm building for Nvidia GPU.

I put all my device code into one .zig file (which can import other files): hw5_kernel.zig
all functions with pub export are the kernels, device entry points visible by the host
I compile the device code using standard zig build-obj -target nvptx64-cuda into a "hw5_kernel.ptx" file (this is uglier than needed because I'm mixing Stage1/Stage2 compilation)
The output file "hw5_kernel.ptx" is passed through build options to the host compilation object and embedded into host object
I also import the device code into the host code hw5.zig
With access to the signature of the device functions the host wrapper code can be offloaded to a library using comptime reflexion

This approach is simpler in the sense that it mostly happen in user-land using Zig killer-features: comptime and build.zig. But it does have some quirks:

the generated .ptx file is relying on the internal Zig ABI and calling convention, trying to call them from a C library won't work.
Note that this is a user choice, I could have written kernels that only accept *u8 and u32 and generated a universal ".ptx".
it requires some fiddling to make the device code to be understandable by the host compiler. In particular I have to change the calling convention between .PtxKernel and .Unspecified when compiling for device/host and prevent Zig to look at the inline device assembly. There also might be inconsistency issues with usize: what if I want to drive a 32-bits device from a 64-bits host, or vice-versa ?

gwenzek · 2021-12-24T07:58:58Z

Notes from yesterday Stage2 meeting (not exact quotes, writing from memory)

@andrewrk

We want to experiment with "ergonomic hetrogeneous computing". If exposing ABI in generated .ptx ends being a bad idea we will reconsider. But we need a first version so that we can get feedback

Andrew also noted that the problem of writing code that can be compiled for both target (quirk 2 from previous list) can be solved by making the compiler more lazy and not try to analyze the body of a function when a function is only used for metaprogramming. He also suggested client-side mitigations.

@Snektron

We don't want to hide the device binary from the user because it conducts to have implicit initialization of the devices, while Zig encourages explicitness.

eira-fransham · 2022-01-30T12:49:41Z

Could Zig, as a primitive, expose a userland version of the build system in comptime that can work on the level of a single function? Such as @compile(options: Builder, function: anytype). That way you can have this functionality in a library, but it can still interact with the same caching systems etc. The main issue I see there is that Zig deliberately doesn't support I/O at comptime so you couldn't then use the results of that computation in further comptime operations (since they would be written to disk). On the one hand that's preferable because it would prevent const evaluation from needing to wait for the entirety of the build process, but on the other hand it could prevent some legitimate use-cases.

EDIT: Actually, since a version of @compile which simply hooks into the build system could only have its results read via side-effect, this wouldn't work precisely as described. You'd need the user to access the compiled artefacts via a return value otherwise it would be impossible for the Zig compiler to know whether it needs to evaluate the @compile statements or not.

andrewrk added this to the 0.11.0 milestone Dec 27, 2021

andrewrk added the use case Describes a real use case that is difficult or impossible, but does not propose a solution. label Dec 27, 2021

gwenzek mentioned this issue Sep 16, 2022

Update Nvptx backend for Zig 0.10 #12878

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of heterogeneous computing #10397

Handling of heterogeneous computing #10397

gwenzek commented Dec 23, 2021

gwenzek commented Dec 23, 2021

gwenzek commented Dec 24, 2021

eira-fransham commented Jan 30, 2022 •

edited

Handling of heterogeneous computing #10397

Handling of heterogeneous computing #10397

Comments

gwenzek commented Dec 23, 2021

Main questions

gwenzek commented Dec 23, 2021

gwenzek commented Dec 24, 2021

eira-fransham commented Jan 30, 2022 • edited

eira-fransham commented Jan 30, 2022 •

edited