Tracking issue for the "ptx-kernel" ABI #38788

japaric · 2017-01-02T23:49:48Z

Feature gate #![feature(abi_ptx)]

This ABI is intended to be used when generating code for device (GPU) targets like nvptx64-nvidia-cuda. It is used to generate kernels ("global functions") that work as an entry point from host (cpu) code. Functions that do not use the "ptx-kernel" ABI are "device functions" and only callable from kernels and device functions. Device functions are specifically not usable from host (cpu) code.

Public API

The following code

#![no_std]
#![feature(abi_ptx)]

#[no_mangle]
pub extern "ptx-kernel" fn foo() {}

Produces

.version 3.2
.target sm_30
.address_size 64

	// .globl	foo

.visible .entry foo()
{
	ret;
}

Steps / History

Fix broken passing of kernel arguments (Fix codegen bug in "ptx-kernel" abi related to arg passing #94703)
Replace PassMode::Direct with something else (Add infrastructure to "compute the ABI of a Rust type, described as a C type" compiler-team#672)
Re-enable ptx CI tests to avoid future breakage (Re-enable nvptx tests #96842)
Emit error for kernels with return value other than ()
Fix the problem where Rust generates types the LLVM PTX cannot select (NVPTX: "LLVM ERROR: Cannot select" when returning struct with 3byte size from "device function" #97174)
Resolve unresolved questions
Create an RFC that specifies the safe way to use this abi (I assume this will be required @pnkfelix?)
Document feature (https://doc.rust-lang.org/reference/items/external-blocks.html#abi)
Stabilization PR

Unresolved Questions

Resolve what kind of stability guarantees can be made about the generated ptx.
- The ABI of kernels have been previously changed for a major version bump and the ptx-interoperability doc is still outdated.
- PTX is an ISA with many versions. The newest is major version 7. Do we need to reserve the possibility of breaking things when moving to a new major version?
- Figure out what llvm does in relations to the nvptx64-nvidia-cuda target and the __global__ modifier.
What kind of types should be allowed to use as arguments in kernels. Should it be a hard error to use these types or only a warning (Global and device kernels are unsound rust-cuda/wg#11)
- The most important part is to find a minimal but useful subset of Rust types that can be used in kernels. raw pointers, primitive types and #[repr(C)] types seems like a good start (no slices, tuples, references, etc).
- Using mutable references is almost certain UB except for a few unusable special cases (spawning a single thread only)
- There are many convenient types in Rust which do not have a stable ABI (&[T], (T, U), etc). Are there some types that do not have a stable representation but can be relied on having an identical representation for sequential compilation with a given rustc version? If so are there any way we could pass them safely between host and device code compiled with the same rustc version?
This unstable feature is one of the last stoppers to using nvptx64-nvidia-cuda on stable Rust. The target seems to still have a few bugs (NVPTX backend metabug #38789). Should this feature be kept unstable to avoid usage of nvptx64-nvidia-cuda until it has been verified to be usable.
How should shared be supported? Is it necessary to do that from the go?
nvptx "ptx-kernel" ABI (feature: abi_ptx) uses PassMode::Direct for Aggregates #117271

Notes

It is not possible to emulate kernels with #[naked] functions as the .entry directive needs to be emitted for nvptx kernels, which requires this ABI.

The text was updated successfully, but these errors were encountered:

…lexcrichton Ensure every unstable language feature has a tracking issue. Filled in the missing numbers: * `abi_ptx` → #38788 * `generators` → #43122 * `global_allocator` → #27389 Reused existing tracking issues because they were decomposed from a larger feature * `*_target_feature` → #44839 (reusing the old `target_feature` number) * `proc_macros_*` → #38356 (reusing the to-be-stabilized `proc_macros` number) Filed new issues * `exhaustive_patterns` → #51085 * `pattern_parentheses` → #51087 * `wasm_custom_section` and `wasm_import_module` → #51088

steveklabnik · 2019-03-12T17:50:18Z

Triage: not aware of any movement on stabilizing this.

lahwran · 2020-03-07T06:50:37Z

would be very interested in this! any of y'all know what needs doing to bring this back to life?

edit: following links through the metabug, and then through a project that mentioned the metabug, looks like there are in fact folks still interested. cool!

kjetilkjeka · 2022-02-23T17:00:13Z

I'm curious if I might have found a codegen bug related to this feature. It is related to how the struct is assumed to be passed to the kernel.

When compiling the following Rust code

#![no_std]
#![feature(abi_ptx, stdsimd)]

#[panic_handler]
fn panic(info: &core::panic::PanicInfo) -> ! {
    unreachable!()
}

#[repr(C)]
pub struct Foo{
    array: [f32; 9],
}

#[no_mangle]
pub unsafe extern "ptx-kernel" fn add(a: *mut f32, b: Foo) {
    *a = b.array[5];
}

Rustc will produce the following

When compiling with rustc +nightly --target nvptx64-nvidia-cuda <filename> -C target-cpu=sm_52 --crate-type cdylib

//
// Generated by LLVM NVPTX Back-End
//

.version 4.1
.target sm_52
.address_size 64

	// .globl	add

.visible .entry add(
	.param .u64 add_param_0,
	.param .u64 add_param_1
)
{
	.reg .f32 	%f<2>;
	.reg .b64 	%rd<5>;

	ld.param.u64 	%rd1, [add_param_0];
	ld.param.u64 	%rd2, [add_param_1];
	cvta.to.global.u64 	%rd3, %rd2;
	cvta.to.global.u64 	%rd4, %rd1;
	ld.global.f32 	%f1, [%rd3+32];
	st.global.f32 	[%rd4], %f1;
	ret;

}

On the other hand, when compiling the following C++ code

struct foo {
    float array[9];
};

extern "C" __global__ void add( float *a, struct foo b) {
    *a = b.array[5];
}

I get the following output for nvcc

When compiling with nvcc -ptx <filename>

//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-30794723
// Cuda compilation tools, release 11.6, V11.6.55
// Based on NVVM 7.0.1
//

.version 7.6
.target sm_52
.address_size 64

	// .globl	add

.visible .entry add(
	.param .u64 add_param_0,
	.param .align 4 .b8 add_param_1[36]
)
{
	.reg .f32 	%f<2>;
	.reg .b64 	%rd<3>;


	ld.param.u64 	%rd1, [add_param_0];
	cvta.to.global.u64 	%rd2, %rd1;
	ld.param.f32 	%f1, [add_param_1+20];
	st.global.f32 	[%rd2], %f1;
	ret;

}

I get the following output for clang

When compiling with clang++ --cuda-device-only -nocudalib --cuda-gpu-arch=sm_52 <filename> -S -o <outfile>

//
// Generated by LLVM NVPTX Back-End
//

.version 4.1
.target sm_52
.address_size 64

	// .globl	add

.visible .entry add(
	.param .u64 add_param_0,
	.param .align 4 .b8 add_param_1[36]
)
{
	.local .align 8 .b8 	__local_depot0[48];
	.reg .b64 	%SP;
	.reg .b64 	%SPL;
	.reg .f32 	%f<11>;
	.reg .b64 	%rd<8>;

	mov.u64 	%SPL, __local_depot0;
	cvta.local.u64 	%SP, %SPL;
	mov.b64 	%rd2, add_param_1;
	ld.param.u64 	%rd1, [add_param_0];
	ld.param.f32 	%f1, [add_param_1];
	ld.param.f32 	%f2, [add_param_1+8];
	ld.param.f32 	%f3, [add_param_1+12];
	ld.param.f32 	%f4, [add_param_1+16];
	ld.param.f32 	%f5, [add_param_1+20];
	ld.param.f32 	%f6, [add_param_1+24];
	ld.param.f32 	%f7, [add_param_1+28];
	ld.param.f32 	%f8, [add_param_1+32];
	ld.param.f32 	%f9, [add_param_1+4];
	add.u64 	%rd3, %SP, 0;
	or.b64  	%rd4, %rd3, 4;
	st.f32 	[%rd4], %f9;
	st.f32 	[%SP+32], %f8;
	st.f32 	[%SP+28], %f7;
	st.f32 	[%SP+24], %f6;
	st.f32 	[%SP+20], %f5;
	st.f32 	[%SP+16], %f4;
	st.f32 	[%SP+12], %f3;
	st.f32 	[%SP+8], %f2;
	st.f32 	[%SP+0], %f1;
	cvta.to.global.u64 	%rd5, %rd1;
	cvta.global.u64 	%rd6, %rd5;
	st.u64 	[%SP+40], %rd6;
	ld.f32 	%f10, [%SP+20];
	ld.u64 	%rd7, [%SP+40];
	st.f32 	[%rd7], %f10;
	ret;

}

Here's the different compiler versions I'm using

$ rustc +nightly --version
rustc 1.61.0-nightly (68369a041 2022-02-22)
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Fri_Dec_17_18:16:03_PST_2021
Cuda compilation tools, release 11.6, V11.6.55
Build cuda_11.6.r11.6/compiler.30794723_0
$ clang++ --version
clang version 10.0.0-4ubuntu1 
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

I would be happy to try to figure out whats going on if anyone can confirm that this is a bug? If anyone would be able to point me in the right direction that would be a nice bonus as well.

pnkfelix · 2022-03-17T18:53:34Z

Visiting as part of a T-compiler backlog bonanza prepass.

This at very least is in a "needs summary" state. But based on @RDambrosio016 's comment from 10 days ago, there may even be design concerns? (At least depending on what scope of potential parameter types we want this ABI to cover at the outset.)

@rustbot label: S-tracking-design-concerns S-tracking-needs-summary

kjetilkjeka · 2022-03-17T19:05:01Z

@pnkfelix I'm currently working on fixing bugs in ptx-kernel in #94703 and would like to help out providing a summary. I have also discovered some design concerns that I can contribute to the summary.

Do you have a template or example of how it should look like?

kjetilkjeka · 2022-03-25T13:20:42Z

Here's a suggestion for an update to the tracking issue to include concerns. Partially copied for japaric's original post and added concerns from and links to relevant issues.

If you have the possibility you should take a look @RDambrosio016

Feature gate #![feature(abi_ptx)]

This ABI is intended to be used when generating code for device (GPU) targets like nvptx64-nvidia-cuda. It is used to generate kernels ("global functions") that work as an entry point from host (cpu) code. Functions that do not use the "ptx-kernel" ABI are "device functions" and only callable from kernels and device functions. Device functions are specifically not usable from host (cpu) code.

Public API

The following code

#![no_std]
#![feature(abi_ptx)]

#[no_mangle]
pub extern "ptx-kernel" fn foo() {}

Produces

.version 3.2
.target sm_30
.address_size 64

	// .globl	foo

.visible .entry foo()
{
	ret;
}

Steps / History

Unresolved Questions

Resolve what kind of stability guarantees can be made about the generated ptx.
- The ABI of kernels have been previously changed for a major version bump and the ptx-interoperability doc is still outdated.
- PTX is an ISA with many versions. The newest is major version 7. Do we need to reserve the possibility of breaking things when moving to a new major version?
- Figure out what llvm does in relations to the nvptx64-nvidia-cuda target and the __global__ modifier.
What kind of types should be allowed to use as arguments in kernels. Should it be a hard error to use these types or only a warning (Global and device kernels are unsound rust-cuda/wg#11)
- The most important part is to find a minimal but useful subset of Rust types that can be used in kernels. raw pointers, primitive types and #[repr(C)] types seems like a good start (no slices, tuples, references, etc).
- Using mutable references is almost certain UB except for a few unusable special cases (spawning a single thread only)
- There are many convenient types in Rust which do not have a stable ABI (&[T], (T, U), etc). Are there some types that do not have a stable representation but can be relied on having an identical representation for sequential compilation with a given rustc version? If so are there any way we could pass them safely between host and device code compiled with the same rustc version?
This unstable feature is one of the last stoppers to using nvptx64-nvidia-cuda on stable Rust. The target seems to still have a few bugs (NVPTX backend metabug #38789). Should this feature be kept unstable to avoid usage of nvptx64-nvidia-cuda until it has been verified to be usable.
How should shared be supported? Is it necessary to do that from the go?

Notes

It is not possible to emulate kernels with #[naked] functions as the .entry directive needs to be emited for nvptx kernels.

… r=nagisa Fix codegen bug in "ptx-kernel" abi related to arg passing I found a codegen bug in the nvptx abi related to that args are passed as ptrs ([see comment](rust-lang#38788 (comment))), this is not as specified in the [ptx-interoperability doc](https://docs.nvidia.com/cuda/ptx-writers-guide-to-interoperability/) or how C/C++ does it. It will also almost always fail in practice since device/host uses different memory spaces for most hardware. This PR fixes the bug and add tests for passing structs to ptx kernels. I observed that all nvptx assembly tests had been marked as [ignore a long time ago](rust-lang#59752 (comment)). I'm not sure if the new one should be marked as ignore, it passed on my computer but it might fail if ptx-linker is missing on the server? I guess this is outside scope for this PR and should be looked at in a different issue/PR. I only fixed the nvptx64-nvidia-cuda target and not the potential code paths for the non-existing 32bit target. Even though 32bit nvptx is not a supported target there are still some code under the hood supporting codegen for 32 bit ptx. I was advised to create an MCP to find out if this code should be removed or updated. Perhaps `@RDambrosio016` would have interest in taking a quick look at this.

… r=nagisa Fix codegen bug in "ptx-kernel" abi related to arg passing I found a codegen bug in the nvptx abi related to that args are passed as ptrs ([see comment](rust-lang#38788 (comment))), this is not as specified in the [ptx-interoperability doc](https://docs.nvidia.com/cuda/ptx-writers-guide-to-interoperability/) or how C/C++ does it. It will also almost always fail in practice since device/host uses different memory spaces for most hardware. This PR fixes the bug and add tests for passing structs to ptx kernels. I observed that all nvptx assembly tests had been marked as [ignore a long time ago](rust-lang#59752 (comment)). I'm not sure if the new one should be marked as ignore, it passed on my computer but it might fail if ptx-linker is missing on the server? I guess this is outside scope for this PR and should be looked at in a different issue/PR. I only fixed the nvptx64-nvidia-cuda target and not the potential code paths for the non-existing 32bit target. Even though 32bit nvptx is not a supported target there are still some code under the hood supporting codegen for 32 bit ptx. I was advised to create an MCP to find out if this code should be removed or updated. Perhaps ``@RDambrosio016`` would have interest in taking a quick look at this.

…exception, r=workingjubilee,RalfJung NVPTX: Allow PassMode::Direct for ptx kernels for now Upgrading the nvptx toolchain to the newest nightly makes it hit the assert that links to rust-lang#115666 It seems like most targets get around this by using `PassMode::Indirect`. That is impossible for the kernel as it's not a normal call, but instead the arguments are copied from CPU to GPU and the passed pointer would be invalid when it reached the GPU. I also made an experiment with `PassMode::Cast` but at least the most simple version of this broke the assembly API tests. I added fixing the pass mode in my unofficial tracking issue list (I do not have the necessary permissions to update to official one). rust-lang#38788 (comment) Since the ptx_abi is currently unstable and have been working with `PassMode::Direct` for more than a year now, the steps above is hopefully sufficient to enable it as an exception until I can prioritize to fix it. I'm currently looking at steps to enable the CI for nvptx64 again and would prefer to finish that first.

Rollup merge of rust-lang#117247 - kjetilkjeka:nvptx_direct_passmode_exception, r=workingjubilee,RalfJung NVPTX: Allow PassMode::Direct for ptx kernels for now Upgrading the nvptx toolchain to the newest nightly makes it hit the assert that links to rust-lang#115666 It seems like most targets get around this by using `PassMode::Indirect`. That is impossible for the kernel as it's not a normal call, but instead the arguments are copied from CPU to GPU and the passed pointer would be invalid when it reached the GPU. I also made an experiment with `PassMode::Cast` but at least the most simple version of this broke the assembly API tests. I added fixing the pass mode in my unofficial tracking issue list (I do not have the necessary permissions to update to official one). rust-lang#38788 (comment) Since the ptx_abi is currently unstable and have been working with `PassMode::Direct` for more than a year now, the steps above is hopefully sufficient to enable it as an exception until I can prioritize to fix it. I'm currently looking at steps to enable the CI for nvptx64 again and would prefer to finish that first.

…, r=workingjubilee,RalfJung NVPTX: Allow PassMode::Direct for ptx kernels for now Upgrading the nvptx toolchain to the newest nightly makes it hit the assert that links to rust-lang/rust#115666 It seems like most targets get around this by using `PassMode::Indirect`. That is impossible for the kernel as it's not a normal call, but instead the arguments are copied from CPU to GPU and the passed pointer would be invalid when it reached the GPU. I also made an experiment with `PassMode::Cast` but at least the most simple version of this broke the assembly API tests. I added fixing the pass mode in my unofficial tracking issue list (I do not have the necessary permissions to update to official one). rust-lang/rust#38788 (comment) Since the ptx_abi is currently unstable and have been working with `PassMode::Direct` for more than a year now, the steps above is hopefully sufficient to enable it as an exception until I can prioritize to fix it. I'm currently looking at steps to enable the CI for nvptx64 again and would prefer to finish that first.

japaric added O-NVPTX Target: the NVPTX LLVM backend for running rust on GPUs, https://llvm.org/docs/NVPTXUsage.html B-unstable Feature: Implemented in the nightly compiler and unstable. labels Jan 2, 2017

japaric mentioned this issue Jan 2, 2017

NVPTX backend metabug #38789

Open

sfackler added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Jan 3, 2017

Mark-Simulacrum added the C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. label Jul 22, 2017

kennytm mentioned this issue May 26, 2018

Ensure every unstable language feature has a tracking issue. #51090

Merged

kjetilkjeka mentioned this issue Mar 7, 2022

Fix codegen bug in "ptx-kernel" abi related to arg passing #94703

Merged

rustbot added S-tracking-design-concerns Status: There are blocking ❌ design concerns. S-tracking-needs-summary Status: It's hard to tell what's been done and what hasn't! Someone should do some investigation. labels Mar 17, 2022

ViliamVadocz mentioned this issue May 4, 2023

Possible to use #![feature(abi_ptx)]? coreylowman/cudarc#143

Open

kjetilkjeka mentioned this issue Oct 26, 2023

NVPTX: Allow PassMode::Direct for ptx kernels for now #117247

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking issue for the "ptx-kernel" ABI #38788

Tracking issue for the "ptx-kernel" ABI #38788

japaric commented Jan 2, 2017 •

edited by RalfJung

steveklabnik commented Mar 12, 2019

lahwran commented Mar 7, 2020 •

edited

kjetilkjeka commented Feb 23, 2022 •

edited

pnkfelix commented Mar 17, 2022

kjetilkjeka commented Mar 17, 2022

kjetilkjeka commented Mar 25, 2022 •

edited

Tracking issue for the "ptx-kernel" ABI #38788

Tracking issue for the "ptx-kernel" ABI #38788

Comments

japaric commented Jan 2, 2017 • edited by RalfJung

Public API

Steps / History

Unresolved Questions

Notes

steveklabnik commented Mar 12, 2019

lahwran commented Mar 7, 2020 • edited

kjetilkjeka commented Feb 23, 2022 • edited

pnkfelix commented Mar 17, 2022

kjetilkjeka commented Mar 17, 2022

kjetilkjeka commented Mar 25, 2022 • edited

Public API

Steps / History

Unresolved Questions

Notes

japaric commented Jan 2, 2017 •

edited by RalfJung

lahwran commented Mar 7, 2020 •

edited

kjetilkjeka commented Feb 23, 2022 •

edited

kjetilkjeka commented Mar 25, 2022 •

edited