Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pointers for NVPTX support #10064

Closed
gwenzek opened this issue Oct 30, 2021 · 11 comments
Closed

Pointers for NVPTX support #10064

gwenzek opened this issue Oct 30, 2021 · 11 comments
Milestone

Comments

@gwenzek
Copy link
Contributor

gwenzek commented Oct 30, 2021

NVPTX backend is currently Tier 4 in Zig,
yet there still might be people interested in improving support for it (like myself)
Here I'm sharing a few pointers provided by @Snektron to get started, complemented by information I've glanned from LLVM and Nvidia documentation.

I've also started a branch with an ongoing implementation.
I'm learning both about the Zig compiler and LLVM at the same time so beware ! gwenzek#1

Adress spaces

PTX format defines a VM for the GPU and an ISA. The VM make uses of different memory spaces.
Address space support has been added to Stage 2, so you'll need to use stage 2.
Only a few generic adress spaces have been defined, but more should be added for Nvidia GPU:

zig/lib/std/builtin.zig

Lines 172 to 177 in 70ef9bc

pub const AddressSpace = enum {
generic,
gs,
fs,
ss,
};

And defaults here:

zig/src/target.zig

Lines 573 to 577 in 70ef9bc

) std.builtin.AddressSpace {
_ = target;
_ = context;
return .generic;
}

(for nvptx you probably want the .constant addrspace for constants for example, also .local for locals, etc
Then convert Zig address spaces to llvm address spaces here:

fn llvmAddressSpace(self: DeclGen, address_space: std.builtin.AddressSpace) c_uint {

The Zig bindings already have definition for Nvidia memory spaces:

pub const nvptx = struct {
pub const generic: c_uint = 0;
pub const global: c_uint = 1;
pub const constant: c_uint = 2;
pub const shared: c_uint = 3;
pub const param: c_uint = 4;
pub const local: c_uint = 5;
};

To better understand what Zig needs to do you can look at a sample kernel written in LLVM IR:
https://llvm.org/docs/NVPTXUsage.html#the-kernel
_I guess it would also be helpful to be able to generate LLVM IR from an arbitrary kernel
apparently Clang can do so: https://www.llvm.org/docs/CompileCudaWithLLVM.html

Also one would need to implement support for Cuda special variables gridDim, gridIdx, ... which are stored in specials registers:
https://llvm.org/docs/NVPTXUsage.html#id7
as well as the intrinsic for the block barrier __syncthreads

I'm not sure how to do that yet.

@Snektron
Copy link
Collaborator

Snektron commented Oct 30, 2021

Intrinsics can be added by CreateBinaryIntrinsic, CreateUnaryIntrinsic and CreateIntrinsic functions provided by LLVM. See for an example how the Intrinsic::umax intrinsic is emitted in Zig:

zig/src/zig_llvm.cpp

Lines 472 to 475 in 969bcb6

LLVMValueRef ZigLLVMBuildUMax(LLVMBuilderRef B, LLVMValueRef LHS, LLVMValueRef RHS, const char *name) {
CallInst *call_inst = unwrap(B)->CreateBinaryIntrinsic(Intrinsic::umax, unwrap(LHS), unwrap(RHS), nullptr, name);
return wrap(call_inst);
}

According to the LLVM NVPTX page, __syncthreads is @llvm.nvvm.barrier0(). After some digging (opening build/include/llvm/IR/IntrinsicEnums.inc where build is the LLVM build directory), we can see that this translates to Intrinsic::nvvm_barrier0. Note that there is no C api call for CreateIntrinsic, so a shim would need to be added to zig_llvm.cpp in order to use this, similar to the other Intrinsic usages.

@daurnimator
Copy link
Contributor

@gwenzek FYI on github you can press y to turn the URL into a specific commit; that way your links with line numbers won't move around.

@gwenzek
Copy link
Contributor Author

gwenzek commented Nov 2, 2021

I've started a prototype on gwenzek#1
I've added some glue code so that I can run:

.../stage2/bin/zig build-obj kernel.zig -target nvptx64-cuda -O ReleaseSafe -femit-llvm-ir

This will generate an invalid kernel.ptx which doesn't contain ptx code. I need to dig into LLVM documentation to understand how to make it generate PTX.
I think there is an issue with how LLVM TargetMachine is setup preventing it to generating the output.
I hacked it to make it generate bytecode instead, but that's not PTX.

This will generate a good looking .ptx

The main thing I was surprised is I had to modify codegen.zig even if I don't want to implement PTX generation
from Stage2, and only want to use LLVM for that. Is there a way of having an architecture that is only supported by LLVM and not by Zig itself ?

I've also created a new output format ".ptx" for the linker, as done for sprV or asm architectures.

@Snektron
Copy link
Collaborator

Snektron commented Nov 2, 2021

The path youre going now with gwenzek#1 is implementing your own code generation with nvptx in stage 2. If you don't want to do that, you shouldn't need to create an entire custom link format, or enable nvptx in codegen.zig. In fact, can you try just running zig build-obj -target nvptx-cuda -fLLVM -fno-stage1 kernel.zig with an upstream build of stage 2? I think most of the llvm target definitions are already present, and the main thing to figure out is if we need to get llvm to emit bitcode instead of elf files.

@gwenzek
Copy link
Contributor Author

gwenzek commented Nov 2, 2021

@Snektron just retried without the change in codegen.zig, and it works, I must have used the wrong command at the begining.
The main thing I need to change to output PTX is that I need to ask LLVM to output assembly instead of an object.
For that I need to modify the flushModule to ask for a assembly.
Also I don't want to link the ".ptx" file I obtain so I think I need to create a link/NvPtx.zig if I'm not mistaken to disable that.

I think a refactoring could make this cleaner, by having Compilation.zig check the architecture and enabling/disabling some options in there. Also it could be a good place to add the architecture specific validation that you put at

zig/src/link/SpirV.zig

Lines 73 to 76 in 310f3df

switch (options.target.cpu.arch) {
.spirv32, .spirv64 => {},
else => return error.TODOArchNotSupported,
}

btw upstream stage2 yields:

> stage2/bin/zig build-obj cudaz/kernel.zig -target nvptx64-cuda -O ReleaseSafe 
error(codegen): LLVM failed to emit asm=(none) bin=kernel.o.o ir=(none) bc=(none): TargetMachine can't emit an object file
error: FailedToEmit

@gwenzek
Copy link
Contributor Author

gwenzek commented Nov 4, 2021

I've started looking into how to add support for the PTX intrinsics mentionned above. The default way seems to add an @ptxSyncThreads and @ptxThreadId, ... to Zig throught the various stage of the compiler. (Like @wasmMemoryGrow is implemented).

Other approaches have been discussed in #7702, #4466.
But by looking in the issue backlog I also found #2291 that says that the @"identifier" syntax can be abused to generate LLVM intrinsic.
I've tried it and the following snippet compiles with nvptx backend:

export fn hello(out: [*c]u8) void {
    out.* = 72 + @intCast(u8, threadIdX());
    __syncthreads();
}

extern fn @"llvm.nvvm.barrier0"() void;
inline fn __syncthreads() void {
    @"llvm.nvvm.barrier0"();
}

But for some reason LLVM called from Zig crashes when doing something similar for reading the special registers

extern fn @"llvm.nvvm.read.ptx.sreg.tid.x"() i32;
inline fn threadIdX() i32 {
    return @"llvm.nvvm.read.ptx.sreg.tid.x"();
}

export fn hello(out: [*c]u8) void {
    out.* = 72 + @intCast(u8, threadIdX());
}

the error message:

~/github/zig/stage2/bin/zig build-obj cudaz/kernel.zig -target nvptx64-cuda -O ReleaseSafe --verbose-llvm-ir 
info(link): Creatinsg .ptx target kernel.ptx.ptx
info(link): Updating function: 3:3
info(link): Updating function: 41:46
info(link): Flush kernel.ptx.ptx
; ModuleID = 'kernel'
source_filename = "kernel"
target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"

; Function Attrs: noredzone nounwind
define dso_local void @entry() #0 {
Entry:
  ret void
}

; Function Attrs: noredzone nounwind
define dso_local void @hello(i8* %0) #0 {
Entry:
  %1 = alloca i32, align 4
  %2 = alloca i8*, align 8
  store i8* %0, i8** %2, align 8
  %3 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  store i32 %3, i32* %1, align 4
  %4 = load i32, i32* %1, align 4
  %5 = trunc i32 %4 to i8
  %6 = add nuw i8 72, %5
  store i8 %6, i8* %0, align 1
  ret void
}

; Function Attrs: noredzone nounwind
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() #0

attributes #0 = { noredzone nounwind "frame-pointer"="none" }
info(codegen): emit LLVM object to kernel.ptx.ptx
LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.read.ptx.sreg.tid.x
zsh: IOT instruction (core dumped)  ~/github/zig/stage2/bin/zig build-obj cudaz/kernel.zig -target  -O  
134

Interestingly pasting the debug output into a .ll and calling directly llc on it works, and generate the following .ptx file:
So there must be something special in how we calling LLVM from Zig. I've tried changing some options, but without success.
I'm not familiar enough with LLVM to understand what could lead to a Cannot select error.

~/local/llvm13-release/bin/llc -mcpu=sm_20 --mtriple=nvptx64-nvidia-cuda kernel.ll -o kernel.ll.ptx

	// .globl	hello                   // -- Begin function hello
.visible .func hello(
	.param .b64 hello_param_0
)                                       // @hello
{
	.reg .b16 	%rs<3>;
	.reg .b32 	%r<2>;
	.reg .b64 	%rd<2>;

// %bb.0:                               // %Entry
	ld.param.u64 	%rd1, [hello_param_0];
	mov.u32 	%r1, %tid.x;
	cvt.u16.u32 	%rs1, %r1;
	add.s16 	%rs2, %rs1, 72;
	st.u8 	[%rd1], %rs2;
	ret;
                                        // -- End function
}

Anyway given that the "assembly" corresponding to reading from a register is pretty simple I also tried to
generate the corresponding assembly directly using asm keyword which seems to have seen some recent improvement in Stage 2.

So to generate mov.u32 %r1, %tid.x; in PTX it seems I need
call i32 asm sideeffect "mov.u32 \09$0, %tid.x;", "=r"() in LLVM, which can be generated by

asm volatile ("mov.u32 \t$0, %tid.x;"
        : [ret] "=r" (-> i32)
    );

So stay tuned, because I feel I'm making progress ^^
I still need to test the generated ptx, but I feel l'll be able to make some progress without having to poke too much into stage2 internals (which I appreciate)

@andrewrk andrewrk added frontend Tokenization, parsing, AstGen, Sema, and Liveness. and removed frontend Tokenization, parsing, AstGen, Sema, and Liveness. labels Nov 20, 2021
@andrewrk andrewrk added this to the 0.10.0 milestone Nov 20, 2021
@gwenzek
Copy link
Contributor Author

gwenzek commented Nov 20, 2021

Since my presentation at last Zig Rush I removed the worst hacks,
and made PR #10189

@gwenzek
Copy link
Contributor Author

gwenzek commented Sep 16, 2022

I've opened #12878 to update the backend to 0.10. I feel we are in a good place now, I'll close this issue.

Thanks to the progress of self hosted, we can now generate debug information in the PTX files. Also the assembly syntax now works as documented, so it's simpler to use special ptx registers. And I can use the same zig binary for building the device and host code, so it's pretty exciting.

See https://github.com/gwenzek/cudaz/tree/e8895596009c689300fe7c7193fa2dbf7db07629 for user code using this Zig branch.

@gwenzek gwenzek closed this as completed Sep 16, 2022
@andrewrk andrewrk modified the milestones: 0.12.0, 0.10.0 Sep 30, 2022
@classner
Copy link

@gwenzek Is it possible to experiment with this on the latest zig (since it looks like all PRs have been merged)? I'm just curious, since Andrew added it to the 0.12.0 milestone - so is it not on main yet?

@nektro
Copy link
Contributor

nektro commented May 12, 2023

its in the 0.10 milestone

@Snektron
Copy link
Collaborator

Support for the ptx backend is kind of a work in progress, but it should work for some programs already. You can check out this repository for some pointers on how to get started, but I think the Zig parts are a little bit out of date. In general it should mostly work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants