Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upNVPTX backend metabug #38789
Comments
japaric
added
metabug
O-NVPTX
labels
Jan 2, 2017
Mark-Simulacrum
added
the
C-tracking-issue
label
Jul 26, 2017
This comment has been minimized.
This comment has been minimized.
|
It would be better if instead of having to create a new module / crate, adding the That way we could just write our kernels inline with normal Rust code. #[target_feature(enabled = "ptx_device_fn")] unsafe fn a_device_fn(...) { ... }
#[target_feature(enabled = "ptx_kernel")] unsafe fn a_kernel(...) { ... a_device_fn(...) ... }
unsafe fn bar() {
cuda::driver::launch(x, y, z, w, a_kernel);
cuda::driver::launch(x, y, z, w, #[target_feature(enabled = "ptx_kernel")] |...| a_kernel(...) );
}This way users can use the typical |
This comment has been minimized.
This comment has been minimized.
|
@japaric @rkruppe @alexcrichton I'd like to work on this. |
This comment has been minimized.
This comment has been minimized.
|
You seem to assume a "single source" model? We don't have that currently. You have to compile your kernels as one crate for a Supporting that would require novel frontend integration (novel for Rust; clang has something like this already). For example rustc would have to decide for each translation item whether it should be compiled for the host, for the device, or both -- and then combining the resulting PTX and host object files. Additionally, even if/once we have "single source", |
This comment has been minimized.
This comment has been minimized.
Indeed, makes sense.
I'd like to work on enabling this via a mixture of the ABI solution to choose, e.g., the If we could have these multiple ABIs into a single source file, we could have #[inline]
fn baz() { }
#[target_device(host, nvptx(sm = "40"), spirv(version = "1.0"))]
fn bar(...) { ... }
#[target_device_kernel(host, nvptx(sm = "40"), spirv(version = "1.0"))]
fn foo(...) {
#[device] bar(...); // device attribute indicates that this fn is a device fn
baz(); // this function will be used as is
}that expand to: fn baz_host(...) { bar_host(...); baz(); }
#[target_feature(enabled = "sm40")]
extern "ptx" fn baz_nvptx(...) { bar_nvptx(...); baz(); }
extern "spriv" fn baz_spirv(...) { bar_spirv(...); baz(); }
fn foo_host(...) { bar_host(...); baz(); }
#[target_feature(enabled = "sm40")]
extern "ptx-kernel" fn foo_nvptx(...) { bar_nvptx(...); baz(); }
extern "spirv-kernell" fn foo_spirv(...) { bar_device(...); baz(); }And then just launch the kernels using another procedural macro, e.g., One cool feature of clang and nvcc is to allow whoever builds the library to easily choose the devices to target. Procedural macros could allow these via feature flags: This approach leaves the door open to doing something nicer in the language in the future, while allowing libraries to experiment with better APIs. I wonder whether these two building blocks (extern ABIs in a single source file, and |
This comment has been minimized.
This comment has been minimized.
|
Frankly, I don't see how any of the tools we have in the language now (target_feature, proc macros, ABIs) can help at all with single source support. Right now, one crate is compiled for one target, period. Subsets of the crate can tweak some parts of the target (e.g., use non-standard ABIs or enable/disable instruction set extensions) but that's a far cry from slicing out a subset of the crate, compiling it for a completely different target, and then stitching the results back together -- and that's precisely what is necessary for single-source offloading (not just CUDA, but also everything else along these lines that I've seen). In fact the assumption that one crate == one target goes as far as Even if one attempts to minimize the amount of compiler changes needed during prototyping for faster iteration (generally a good idea) by e.g. splitting the crate into two crates with an external tool and invoking rustc twice, there is ample room for compiler hacking. Even the bare minimum of single source support requires name resolution information, and being able to use generic library code will require type system integration as well. So in my opinion, this is a rather big feature with at least as much need for experimentation and compiler hacking and design work as SIMD intrinsics. I say this not to discourage you but because your posts so far ignore the technical challenges that are, in my opinion, the biggest obstacle to single source support. I'm also rather puzzled by the priorities here. Before experimenting with the best way to allow users to compile their single-source applications not just for multiple CUDA devices but also for entirely different targets like SPIR-V, basic features like an equivalent to |
This comment has been minimized.
This comment has been minimized.
Oh no, I think I expressed myself wrong. I meant that once we get single source support using extern ABIs, the combination of tools that we have already available in the language can allow for pretty nice APIs.
I think that |
This comment has been minimized.
This comment has been minimized.
What does this mean? I am not aware of any plans for any kind of single source support. And what does "using extern ABIs" mean? It seems to presuppose some strategy for single source support but it's not clear to me which one (and it doesn't sound like any of the strategies that I am aware of). Finally, assuming I'm correct that single source support is not on the horizon, I'm puzzled why we're hashing out details of how it could be exposed better to the user if the basic technical prerequsites aren't even on the horizon.
IIUC such an intrinsic would be basically like an |
This comment has been minimized.
This comment has been minimized.
How so? The kernel does not allocate anything:
I'd like to work on enabling single source support and I'd like to enable it in such a way that it is useful.
In a single source model: extern "ptx-kernel" unsafe fn foo(...) { ... is compiled to a ptx kernel ... }
fn bar(...) { ... is compiled for the host ... } |
This comment has been minimized.
This comment has been minimized.
It's a storage specifies in C parlance. You declare variables to live in shared memory as opposed to thread-private memory or global memory or constant memory. In pointer types it's just an optional hint that the pointee lives in shared memory, that aspect isn't even needed. What is absolutely necessary is to be able to do declare locals like |
This comment has been minimized.
This comment has been minimized.
Sure, but what's the point of making it a storage specifier in Rust? You can't have two variables on shared memory, that is, the following is not valid CUDA C: __global__ void foo(float* foo) {
__shared__ a float[];
__shared__ b float[]; // ERROR: you can only have one pointer to shared memory per kernel
foo[0] = a[0] + b[t0];
}
In particular, here you are not allocating a So IMO, independently of what the spec says, we should focus on the actual semantics of This extern __shared__ a float[];is just: float* a = __get_ptr_to_shared_memory(); |
This comment has been minimized.
This comment has been minimized.
|
I want to point out that we're pretty badly derailing this metabug. You should probably open a thread on internals.rlo if you want to take this discussion much further. But first I would invite you to double check your facts. Many things you've said go against everything I've ever heard and seen about CUDA (and other offloading solutions, for that matter).
I'm not saying it should be a storage specifier in Rust. We don't even have such a concept at the moment. I'm just saying, it's fundamentally a variation on variable declaration (like DST locals), not something about pointers.
???? Check the code I linked earlier, it has at least two such variables. Or pretty a random non-trivial CUDA program using shared memory. I don't know why the code you give is rejected, but I've never even seen this syntax so I'm not even sure what it means.
Are you saying Nvidia's examples, as well as all the other programs using shared memory, are effectively borked? That can't be right. In fact, I'm pretty sure the size of the is recorded (provided it has a size -- again I don't know wth That reminds me, another way in which such intrinsic for |
This comment has been minimized.
This comment has been minimized.
That syntax is the dynamic shared memory allocation syntax.
The size of the shared memory region can be specified at run-time, at least in CUDA. Reading through the docs of the example you mention, if the size of the variables allocated in shared memory are compile-time constants, and no dynamic shared memory allocation occurs, it looks like one does not need to specify the memory to allocate during kernel launch because the compiler does it for you (but I've always used dynamic shared memory so I am not sure). |
This comment has been minimized.
This comment has been minimized.
|
Good to know that there's a dynamic allocation strategy as well. But it seems that you still specify the size, just at kernel invocation time? That seems like it would still allow the driver to make sure enough memory is available (i.e., as much memory as the kernel invocation specified; of course the kernel still needs to obtain and use that number correctly). For dynamic an intrinsic might be good, but since static shared memory allocation seems extremely common, we'd probably want to support it as well and an intrinsic can't really do that (well). |
This comment has been minimized.
This comment has been minimized.
Yes.
Exactly. The typical way in which this is used is by passing something that correlates with the allocated size as a run-time argument to the kernel.
Yes definitely. Since dynamic shared memory is more powerful (it allows doing everything that can be done with static shared memory and some more), has no drawbacks over static shared memory beyond ergonomics (shared memory is always allocated at run-time, whether the size is known are compile-time or not is pretty much irrelevant), and can probably just be an Adding support for AFAIK only fixed-size arrays are allowed in static shared memory and the memory must be uninitialized. So while something like |
LifeIsStrange
referenced this issue
Jun 15, 2018
Open
Tracking issue for targeting AMDGPU devices #51575
DiamondLovesYou
referenced this issue
Jun 18, 2018
Open
[Mini-RFC] Tracking issue for single source cross-compilation #51623
This comment has been minimized.
This comment has been minimized.
lilith
commented
Jul 28, 2018
|
Note that I'm interested in funding work on this: https://internals.rust-lang.org/t/nvptx-funding/7441 I'd like to get this to work out of the box on nightly. |
This comment has been minimized.
This comment has been minimized.
Jasper-Bekkers
commented
Aug 2, 2018
|
Hi, I've started using the NVPTX backend for some simple experiments, I'm listing my experiences here so far since I don't know what the proper protocol is. We can turn these into specific issues on the right repo's later on.
pub struct MyStruct {
data: u32,
}
impl PartialEq for MyStruct {
fn eq(&self, other: &Self) -> bool {
return self.data == other.data;
}
}Leads to invalid PTX since symbols are being generated with dots in them:
I haven't looked much into it but I have a feeling that it's due to https://github.com/rust-lang/rust/blob/master/src/librustc_codegen_utils/symbol_names.rs#L424
On the bright side: this has been a really pleasant GPU programming experience so far (other then actually getting it set up) because it's extremely valuable to share the same codebase between CPU and GPU. |
This comment has been minimized.
This comment has been minimized.
bheisler
commented
Aug 4, 2018
|
I have not been able to compile to PTX with cargo or xargo, I've only been able to do so using accel. Therefore, some of the following may be issues with Accel. That seems unlikely, so I'll report them here.
Most of these are probably due to references to missing functions in the final PTX, but they'll need to be dealt with somehow. I am interested in contributing to improve the state of GPGPU in Rust. Not sure where to start. |
This comment has been minimized.
This comment has been minimized.
termoshtt
commented
Aug 4, 2018
I've met this issue while developing accel , and it prevent me to use libcore for nvptx target. accel cannot link libcore or other std libraries currently. I recently start to write a patch to rustc to enable nvptx target.
this seems to be a good information for me :) |
termoshtt
referenced this issue
Aug 11, 2018
Merged
Use '_' instead of '.' in symbol name sanitization #6
This comment has been minimized.
This comment has been minimized.
bheisler
commented
Aug 18, 2018
|
#53099 is relevant here. I haven't been able to compile any kernel recently, because of a segfault while compiling |
This comment has been minimized.
This comment has been minimized.
|
@bheisler I believe it somehow related to definition json. It doesn't happend to me with json from ptx-linker, but I saw the problem with another one. |
This comment has been minimized.
This comment has been minimized.
|
I'm finally proud to announce my progress on CUDA integration. I've made several tools to ease development and currently working on a tutorial and high-level crate (it will probably be a custom rustc driver because First one is a ptx-linker that solves several important problems:
I started work on the linker about a year ago, and today achieved important milestone: it doesn't depend on any external tools and libs anymore. So end users don't need to care about matching Rust's and system's LLVM versions (which became a problem when Rust switched to LLVM 7.0). The second crate is a ptx-builder that improves development convenience dramatically. It's a Also worth checking, an incomplete tutorial about CUDA development flow with more or less real example. |
This comment has been minimized.
This comment has been minimized.
termoshtt
commented
Aug 20, 2018
|
@denzp Can ptx-linker link with libcore? I am creating a toolchain to link libcore using llvm-link in rust-accel/nvptx. Linking of libcore will cause the symbol name issue as reported by @Jasper-Bekkers due to the difference between GAS and PTX, and I avoid it by rewriting librustc_codegen_utils/symbol_names.rs. |
This comment has been minimized.
This comment has been minimized.
|
@termoshtt The linker suppose to fix this, it has a special "pass" that does renaming. The problem can happen not only with src_image.pixel(i, j)
I found the linker robust enough about solving the issue. But still, I'd probably prefer this to be fixed in rustc. |
This comment has been minimized.
This comment has been minimized.
bheisler
commented
Aug 21, 2018
|
@denzp - I don't think it's caused by the target JSON, unfortunately. When I add the |
This comment has been minimized.
This comment has been minimized.
bheisler
referenced this issue
Aug 28, 2018
Open
Segfault in rustc while cross-compiling core #53099
This comment has been minimized.
This comment has been minimized.
|
#38824 is closed about a month ago. I think we can remove it from the list? |
japaric commentedJan 2, 2017
•
edited
The NVPTX backend has been available since: nightly-2017-01-XX
This is a collections of bugs and TODOs related to it.
Documentation
Bugs
LLVM assertion when compiling
coreto PTX. #38824LLVM error when emitting PTX code with debuginfo. #38785
NVPTX: No "undefined reference" error is raised when it should be. #38786
NVPTX: non-inlined functions can't be used cross crate. #38787
Missing features
__shared__modifier. Probably needs an RFC to landin the compiler as we don't have anything similar to it (AFAIK).
Stabilization
All the non-trivial kernels make use of intrinsics like
blockIdx.x. These will have to be stabilized. Right now these intrinsics are implemented as"plaform-intrinsics"but that feature is unstable.Stabilize the
"ptx-kernel"ABI. #38788cc @rkruppe