Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix codegen bug in "ptx-kernel" abi related to arg passing #94703

Merged
merged 2 commits into from
Apr 26, 2022

Conversation

kjetilkjeka
Copy link
Contributor

@kjetilkjeka kjetilkjeka commented Mar 7, 2022

I found a codegen bug in the nvptx abi related to that args are passed as ptrs (see comment), this is not as specified in the ptx-interoperability doc or how C/C++ does it. It will also almost always fail in practice since device/host uses different memory spaces for most hardware.

This PR fixes the bug and add tests for passing structs to ptx kernels.

I observed that all nvptx assembly tests had been marked as ignore a long time ago. I'm not sure if the new one should be marked as ignore, it passed on my computer but it might fail if ptx-linker is missing on the server? I guess this is outside scope for this PR and should be looked at in a different issue/PR.

I only fixed the nvptx64-nvidia-cuda target and not the potential code paths for the non-existing 32bit target. Even though 32bit nvptx is not a supported target there are still some code under the hood supporting codegen for 32 bit ptx. I was advised to create an MCP to find out if this code should be removed or updated.

Perhaps @RDambrosio016 would have interest in taking a quick look at this.

@rustbot rustbot added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Mar 7, 2022
@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @davidtwco (or someone else) soon.

Please see the contribution instructions for more information.

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Mar 7, 2022
@kjetilkjeka kjetilkjeka changed the title Nvptx kernel args abi2 Fix codegen bug in "ptx-kernel" abi related to arg passing Mar 7, 2022
@bjorn3 bjorn3 added the O-NVPTX Target: the NVPTX LLVM backend for running rust on GPUs, https://llvm.org/docs/NVPTXUsage.html label Mar 7, 2022
@RDambrosio016
Copy link
Contributor

I may be missing it in the PTX docs, but does PTX actually want aggregates less than 8 bytes to be extended to 8 bytes?

@kjetilkjeka
Copy link
Contributor Author

I may be missing it in the PTX docs, but does PTX actually want aggregates less than 8 bytes to be extended to 8 bytes?

You are absolutely right. I though the issue with kernel codegen was limited to the kind of structs I had an issue with and kept some of the old code. I think I will need to expand the test cases a lot to cover more of Rusts types and the rules in PTX Interoperability. I will change this PR to draft now and continue to work on it tomorrow.

I also have to ask about the interest about getting this properly fixed in the compiler. I'm eager to do the work and would love to contribute to the Rust compiler. But it would be nice to know that if I get it right it will be most likely be merged as well. Also it's worth to mention that I have not yet contributed to the Rust compiler and will realistically need a bit more explaining than a seasoned contributor. Perhaps @davidtwco is the right person to give an indication of this interest.

@kjetilkjeka kjetilkjeka marked this pull request as draft March 7, 2022 21:38
@RDambrosio016
Copy link
Contributor

ptx-kernel has pretty much been a dead ABI until now because nvptx64 was broken and unusable. Fixing ptx-kernel is one of the steps to upstreaming the rust-cuda work back to rustc, so this needs to be done either ways. I don't currently use ptx-kernel because as mentioned, its broken, but in the future i'd like to transition to it because specifically, a cuda codegen cannot register a special kernel-marking proc macro (i mean, it could but its ugly and prob wouldn't be accepted), Therefore we need a special ABI to say that a function is a kernel. However the ptx-kernel ABI needs a lot more work, including:

  • Fixing how "regular" params are passed, i.e. pass scalars correctly and pass aggregates correctly.
  • Decide how to pass more exotic rust types such as arrays, tuples, slices, u128, etc. Im of the opinion of:
    • arrays are passed as byte arrays, i.e. by value
    • tuples are passed by value with the elements side by side
    • u128 is passed as a byte array
    • slices are passed by value as a struct of ptr and len. however, i do not currently do this in rust-cuda, i pass it as 2 params, ptr and len. I initially did this for potential FFI compatability, but im starting to think it causes more problems than it solves and it should prob just be passed by value, that way a cust DeviceSlice can just be passed directly instead of needing to pass ptr and len.
  • (optional) Add something to the typechecker to forbid a function calling a ptx-kernel function. However this needs some discussion because it might be beneficial to do a "mock" function call for dynamic parallelism to typecheck dynamic kernel calls.

@kjetilkjeka
Copy link
Contributor Author

I'm glad to hear that it is useful for rust-cuda. I would love to help out on things related to improving ptx generation through llvm and preparing for upstreaming the nvcc backend. I'm quite new to both ptx and rust compiler contributions so all guidance is highly appreciated.

I don't disagree on any of your points on how values should be passed. Especially making it possible to pass arrays as values without wrapping them in structs is a huge improvement compared to C/C++. However I don't see the logic of treating tuples as multiple values vs as a struct, but perhaps with a little explanation. I also assume this can easily be done right now for this unstable calling convention, since it's practically unusable in its current state I assume nobody is relying on it.

On the other side, would it require a lot more time to stabilize the calling convention if all of these decisions had to be made immediately? I would believe stabilizing the calling convention with a stable abi only for the following types, integers up to 64bit, floating points, pointers and struct+array of these, would be quite simple. Then tuples, 128bits and slices could come as additional RFCs. This do still mean they should be implemented properly from the start.


Perhaps unrelated to the "ptx-kernel" abi but related to device code. The PTX interoperability says little about return values. I assume it is beneficial to be able to use some specified abi to call into .ptx device functions written in C/C++. Then it would require some standard ABI related to both arguments and returns. Is the idea here to simply follow the C abi?

@RDambrosio016
Copy link
Contributor

I meant that tuples should be passed as structs with their fields ordered logically one after the other.

Also, return convention is not a problem, ptx-kernel is for kernels which cannot return values. For calling into foreign functions, its just the same as normal FFI, it can use the C callconv, which CUDA C++ can generate just fine.

@kjetilkjeka
Copy link
Contributor Author

kjetilkjeka commented Mar 8, 2022

After yesterdays uncovering that kernel args passing are more broken than initially expected i did some research.

First of all 128bit integers is defined in nvcc starting from 11.5. I did a quick test with nvcc and they are being packed as 16byte aligned byte arrays. That makes it a lot simpler for us as doing the thing you proposed is essentially the same as what nvcc does anyway.

What got me completely confused was checking how nvcc compiles arguments of the integer types compared to what the PTX interoperability document states. In section 3.3 it states that integral types of less than 32bits should be zero/sign extended to 32 bits depending on the type. And both .u32 and .s32 should be used depending on the types. What nvcc does in practice is to use exact size but do not differentiate between u/s and use u for everything.

Taking a function using uint8_t as an argument. The ptx interoperability states that the param should be .param .u32 foo_param_0 while what nvcc actually does is .param .u8 foo_param_0. Do you know if the ptx interoperability is only meant for device (non-kernel) function or if Nvidia is not following their own documentation?

@RDambrosio016
Copy link
Contributor

this is why i think the PTX interop guide is either outdated, wrong, or misleading. Extending any arguments would mean that the host code would also need to do so when calling the kernel, which is easily unsound and confusing. I think we should just go with what nvcc does in terms of sizes of everything. Signed-ness does not really matter, it does not affect the kernel caller. Size does however, since using incorrect param sizes when calling the kernel is UB (thanks for not doing the smallest of safety checks cuda).

@kjetilkjeka
Copy link
Contributor Author

A small status update.

Since last I have thrown the PTX interoperability document in the trash and are looking at what nvcc and clang is actually doing.

I have confirmed that no 64bit width extension is required and removed it from the code from both kernels and device functions.

I have added more tests to confirm that the correct ptx code is generated. I'm not sure yet if it will be possible to have these tests enabled in the compiler by default. I must investigate more why the other nvptx tests are ignored.

To the contrary of NVCC and Clang there are no checks in Rust that a kernel does not have a return value. I'm investigating how to best add this to Rust. Perhaps this is outside the scope of this PR, but I would like to know a bit more what it boils down to before deciding to leave it out. If anyone with rustc experience could point me in the right direction that would be lovely.

@RDambrosio016 can you please explain a bit more what problems there will be by:

  1. Passing slices as two separate u64 instead of as a byte array?
  2. Expand each tuple member to an argument instead of passing as byte array?

@RDambrosio016
Copy link
Contributor

Passing slices as two separate u64 instead of as a byte array?

Nothing, i just thought passing it separately would be better in case someone wants to call the rust kernels from CUDA C++. But im starting to think its not worth it and we should just pass it as one 128-bit param.

Expand each tuple member to an argument instead of passing as byte array?

I think we are misunderstanding eachother, i meant that tuples should be passed as byte arrays precisely. Any other way of passing them would just be confusing.

@bjorn3
Copy link
Member

bjorn3 commented Mar 12, 2022

Slices don't have a stable abi for any calling convention currently. For the current rust abi any type whose layout says Abi::ScalarPair needs to be passed as if both fields are separate arguments (necessary for eg vtable calls). For any other abi it doesn't matter.

@RDambrosio016
Copy link
Contributor

RDambrosio016 commented Mar 12, 2022

Yeah, though a stable way of passing slices to gpu kernels is kind of a prereq to stable CUDA support in rust, since slices are at the core of passing immutable data to kernels. Especially since they result in noalias which is great for performance on GPUs

@RDambrosio016
Copy link
Contributor

Slices don't have a stable abi for any calling convention currently. For the current rust abi any type whose layout says Abi::ScalarPair needs to be passed as if both fields are separate arguments (necessary for eg vtable calls). For any other abi it doesn't matter.

Yeah this does not affect rustcall at all, this is what we do in rust-cuda currently. the rust ABI is left untouched for normal function calls in a kernel

@bjorn3
Copy link
Member

bjorn3 commented Mar 12, 2022

I would like to see slices be stabilized as passing pointer and length as two separate arguments on all platforms like the current implementation. I believe there is an open issue for this on the unsafe-code-guidelines repo. However I don't think nvptx should do this before a guarantee is made on all platforms.

@RDambrosio016
Copy link
Contributor

I would like to see slices be stabilized as passing pointer and length as two separate arguments on all platforms like the current implementation. I believe there is an open issue for this on the unsafe-code-guidelines repo. However I don't think nvptx should do this before a guarantee is made on all platforms.

This is fine for callconvs that are generally aimed at interacting with other languages, however ptx-kernel is mostly just aimed at calling rust kernels with rust wrappers to CUDA. Also, CUDA natively supports passing args of any size, unlike other calling conventions that don't and instead require more exotic things, so CUDA is a bit special in this regard.

@bjorn3
Copy link
Member

bjorn3 commented Mar 12, 2022

The thing is that currently the layout of slices is not guaranteed in any way. We are even allowed to add fields beyond a pointer and size (eg a value derived by encrypting the pointer and size for exploit mitigation) or have for example the size be measured in bytes rather than elements. I think we need to stabilize that we don't do these things before we stabilize how they are passed on any abi.

@RDambrosio016
Copy link
Contributor

Yeah that's true, it's kind of surprising that something so central to rust is still super unstable when it comes to layout/ffi

@bjorn3
Copy link
Member

bjorn3 commented Mar 12, 2022

Also I think slices should be passed as two i64 rather than a single i128. Doing anything else requires extra code in the call conv calculation and slightly bloats ir as rustc implements PassMode::Cast by storing to the stack and then loading again, while PassMode::Pair is implemented as extractelement/insertelement.

@RDambrosio016
Copy link
Contributor

Yeah theres pros and cons to both, id be fine with either one tho.

@kjetilkjeka
Copy link
Contributor Author

kjetilkjeka commented Mar 12, 2022

Thanks for clearing up the confusion around slice passing. For the scope of this PR it's more about doing the most sensible thing and not about assuring stable behavior in regards to slices. The improper_ctypes_definitions lint will trigger when attempting to use a slice in a kernel which is consistent with the C abi. If all other targets currently passes slices as two seperate arguments. If I understand the conclusion correctly we should just pass them as two u64, but without any stability guarantee or encouragement to use for kernel args.

Yeah, though a stable way of passing slices to gpu kernels is kind of a prereq to stable CUDA support in rust

It's always possible to pass as ptr + len manually until slice is supported. Wouldn't it be UB to pass a &mut [T] to a kernel since it will be one alias to the memory for each thread anyway?

I think we are misunderstanding eachother, i meant that tuples should be passed as byte arrays precisely. Any other way of passing them would just be confusing.

What I'm talking about is that it seems like Rust handles tuples differently depending on the size of the tuple. For two elements it will pass them, much like the slice, as two seperate arguments. For three arguments Rust will pass it as a struct. See the tests below for a more concise explanation. I also think expanding the tuples to separate arguments, like rust do for tuples of size 2, would be quite reasonable behavior.

Since tuples also have an unstable ABI, perhaps it's best to do what @bjorn3 refers to in terms of slices and avoid doing anything special to avoid special handling in call conv code?

Test below:

// CHECK: .visible .entry f_u32_u16_tuple_arg(
// CHECK: .param .u32 f_u32_u16_tuple_arg_param_0
// CHECK: .param .u16 f_u32_u16_tuple_arg_param_1
#[no_mangle]
pub unsafe extern "ptx-kernel" fn f_u32_u16_tuple_arg(a: (u32, u16)) {}

// CHECK: .visible .entry f_u8_u8_u32_tuple_arg(
// CHECK: .param .align 4 .b8 f_u8_u8_u32_tuple_arg_param_0[8]
#[no_mangle]
pub unsafe extern "ptx-kernel" fn f_u8_u8_u32_tuple_arg(a: (u8, u8, u32)) {}

@kjetilkjeka
Copy link
Contributor Author

kjetilkjeka commented Mar 16, 2022

Also I think slices should be passed as two i64 rather than a single i128. Doing anything else requires extra code in the call conv calculation and slightly bloats ir as rustc implements PassMode::Cast by storing to the stack and then loading again, while PassMode::Pair is implemented as extractelement/insertelement.

@bjorn3 it seems like also a struct with two elements (like the one below) will be implemented as PassMode::Pair even if it is marked as #[repr(C)]. I don't see any way to check if the type is actually a slice or an actual struct in compiler::rustc_target::abi::call::ArgAbi<'a, Ty> as it is generic over the type without being bound on any traits.

I think the reason this does not matter for C is that it's just stack memory anyway in an ordinary call. I looked at wasm for inspiration but didn't get that much wiser. For ptx kernels it is important that the number of arguments match up between caller and callee.

Are there any ways to keep slices as PassMode::Pair while casting structs that are also PassMode::Pair? Should this be special cased for kernels at a different place than rustc_target? Or is it simpler to just treat everything the same and also cast slices?

Here is an example of struct that is made into PassMode::pair

#[repr(C)]
pub struct DoubleFloat {
    f: f32,
    g: f32,
}

@bjorn3
Copy link
Member

bjorn3 commented Mar 16, 2022

I don't see any way to check if the type is actually a slice or an actual struct in compiler::rustc_target::abi::call::ArgAbi<'a, Ty> as it is generic over the type without being bound on any traits.

You could add a method to TyAbiInterface to check if something is a #[repr(C)] struct or something like that. The actual abi calculation impls have a Ty: TyAbiInterface bound.

@bors
Copy link
Contributor

bors commented Apr 21, 2022

⌛ Testing commit cea692a01169bdcf5a109b856204c0c5f4e19c57 with merge 8783abad5edb3f8d2c6097066708ab49753d6d44...

@rust-log-analyzer

This comment has been minimized.

@bors
Copy link
Contributor

bors commented Apr 21, 2022

💔 Test failed - checks-actions

@bors bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Apr 21, 2022
@kjetilkjeka
Copy link
Contributor Author

It seems the CI fails due to not having ptx-linker. Should I ignore the test until the LLVM bug can be resolved and we can use emit-asm?

@nagisa
Copy link
Member

nagisa commented Apr 24, 2022

Hm, curious. Other ptx-linker tests do indeed just ignore nvptx64. In that case I'd say ignore this test as well and work on making the test infrastructure for nvptx work can happen as a following, whether it is done through making the ptx-linker available somehow or by fixing LLVM asm emission.

@kjetilkjeka
Copy link
Contributor Author

@nagisa // ignore-nvptx64 added for now. My next step is to try figure out how we can enable nvptx tests again.

@nagisa
Copy link
Member

nagisa commented Apr 26, 2022

@bors r+

@bors
Copy link
Contributor

bors commented Apr 26, 2022

📌 Commit 5bf5acc has been approved by nagisa

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Apr 26, 2022
GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this pull request Apr 26, 2022
… r=nagisa

Fix codegen bug in "ptx-kernel" abi related to arg passing

I found a codegen bug in the nvptx abi related to that args are passed as ptrs ([see comment](rust-lang#38788 (comment))), this is not as specified in the [ptx-interoperability doc](https://docs.nvidia.com/cuda/ptx-writers-guide-to-interoperability/) or how C/C++ does it. It will also almost always fail in practice since device/host uses different memory spaces for most hardware.

This PR fixes the bug and add tests for passing structs to ptx kernels.

I observed that all nvptx assembly tests had been marked as [ignore a long time ago](rust-lang#59752 (comment)). I'm not sure if the new one should be marked as ignore, it passed on my computer but it might fail if ptx-linker is missing on the server? I guess this is outside scope for this PR and should be looked at in a different issue/PR.

I only fixed the nvptx64-nvidia-cuda target and not the potential code paths for the non-existing 32bit target. Even though 32bit nvptx is not a supported target there are still some code under the hood supporting codegen for 32 bit ptx. I was advised to create an MCP to find out if this code should be removed or updated.

Perhaps `@RDambrosio016` would have interest in taking a quick look at this.
This was referenced Apr 26, 2022
bors added a commit to rust-lang-ci/rust that referenced this pull request Apr 26, 2022
…laumeGomez

Rollup of 8 pull requests

Successful merges:

 - rust-lang#94022 (Clarify that `Cow::into_owned` returns owned data)
 - rust-lang#94703 (Fix codegen bug in "ptx-kernel" abi related to arg passing)
 - rust-lang#95949 (Implement Default for AssertUnwindSafe)
 - rust-lang#96361 (Switch JS code to ES6)
 - rust-lang#96372 (Suggest calling method on nested field when struct is missing method)
 - rust-lang#96386 (simplify `describe_field` func in borrowck's diagnostics part)
 - rust-lang#96400 (Correct documentation for `Rvalue::ShallowInitBox`)
 - rust-lang#96415 (Remove references to git.io)

Failed merges:

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit fe49981 into rust-lang:master Apr 26, 2022
@bors
Copy link
Contributor

bors commented Apr 26, 2022

⌛ Testing commit 5bf5acc with merge 082e4ca...

@rustbot rustbot added this to the 1.62.0 milestone Apr 26, 2022
juntyr added a commit to juntyr/rust-cuda that referenced this pull request May 19, 2024
Experiments with Rust Futures

Implemented derive for RustToCudaAsync

Implemented async kernel launch

Fixed RustToCudaAsync derive

LaunchPackage with non-mut Stream

Moved stream to be an explicit kernel argument

Updated ExchangeWrapperOn[Device|Host]Async::move_to_stream

Upgraded to fixed RustaCuda

Added scratch-space methods for uni-directional CudaExchangeItem

Added unsafe-aliasing API to SplitSlideOverCudaThreads[Const|Dynamic]Stride

Extended the CudaExchangeItem API with scratch and uMaybeUninit

Rename SplitSliceOverCudaThreads[Const|Dynamic]Strude::alias_[mut_]unchecked

Implemented #[cuda(crate)] and #[kernel(crate)] attributes

Added simple thread-block shared memory support

Fixed device utils doc tests

Convert cuda thread-block-shared memory address to generic

First steps towards better shared memory, including dynamic

Revert derive changes + R2C-based approach start

Some progress on shared slices

Backup of progress on compile-time PTX checking

Clean up the PTX JIT implementation

Add convenience functions for ThreadBlockShared arrays

Improve and fix CI

Remove broken ThreadBlockShared RustToCuda impl

Refactor kernel trait generation to push more safety constraints to the kernel definition

Fixed SomeCudaAlloc import

Added error handling to the compile-time PTX checking

Add PTX lint parsing, no actual support yet

Added lint checking support to monomorphised kernel impls

Improve kernel checking + added cubin dump lint

Fix kernel macro config parsing

Explicitly fitting Device[Const|Mut]Ref into device registers

Switched one std:: to core::

Remove register-sized CUDA kernel args check, unnecessary since rust-lang/rust#94703

Simplified the kernel parameter layout extraction from PTX

Fix up rebase issues

Install CUDA in all CI steps

Use CStr literals

Simplify and document the safety traits

Fix move_to_cuda bound

Fix clippy for 1.76

Cleaned up the rust-cuda device macros with better print

The implementation still uses String for dynamic formatting, which
currently pulls in loads of formatting and panic machinery.

While a custom String type that pre-allocated the exact format String
length can avoid some of that, the formatting machinery even for e.g.
usize is still large.

If `format_args!` is ever optimised for better inlining, the more
verbose and lower-level implementation could be reconsidered.

Switch to using more vprintf in embedded CUDA kernel

Make print example fully executable

Clean up the print example

ptr_from_ref is stable from 1.76

Exit on CUDA panic instead of abort to allow the host to handle the error

Backup of early progress for switching from kernel traits to functions

More work into kernel functions instead of traits

Eliminate almost all ArgsTrait usages

Some refactoring of the async kernel func type + wrap code

Early sketch of extracting type wrapping from macro into types and traits

Early work towards using trait for kernel type wrap, ptx jit workaround missing

Lift complete CPU kernel wrapper from proc macro into public functions

Add async launch helper

Further cleanup of the new kernel param API

Start cleaning up the public API

Allow passing ThreadBlockShared to kernels again

Remove unsound mutable lending to CUDA for now

Allow passing ThreadBlockSharedSlice to kernel for dynamic shared memory

Begin refactoring the public API with device feature

Refactoring to prepare for better module structure

Extract kernel module just for parameters

Add RustToCuda impls for &T, &mut T, &[T], and &mut [T] where T: RustToCuda

Large restructuring of the module layout for rust-cuda

Split rust-cuda-kernel off from rust-cuda-derive

Update codecov action to handle rust-cuda-kernel

Fix clippy lint

Far too much time spent getting rid of DeviceCopy

More refactoring and auditing kernel param bounds

First exploration towards a stricter async CUDA API

More experiments with async API

Further API experimentation

Further async API experimentation

Further async API design work

Add RustToCudaAsync impls for &T and &[T], but not &mut T or &mut [T]

Add back mostly unchanged exchange wrapper + buffer with RustToCudaAsync impls

Add back mostly unchanged anti-aliasing types with RustToCudaAsync impls

Progress on replacing ...Async with Async<...>

Seal more implementation details

Further small API improvements

Add AsyncProj helper API struct for async projections

Disable async derive in examples for now

Implement RustToCudaAsync derive impls

Further async API improvements to add drop behaviour

First sketch of the safety constraints of a new NoSafeAliasing trait

First steps towards reintroducing LendToCudaMut

Fix no-std Box import for LendRustToCuda derive

Re-add RustToCuda implementation for Final

Remove redundant RustToCudaAsyncProxy

More progress on less 'static bounds on kernel params

Further investigation of less 'static bounds

Remove 'static bounds from LendToCuda ref kernel params

Make CudaExchangeBuffer Sync

Make CudaExchangeBuffer Sync v2

Add AsyncProj proj_ref and proj_mut convenience methods

Add RustToCudaWithPortableBitCloneSemantics adapter

Fix invalid const fn bounds

Add Deref[Mut] to the adapters

Fix pointer type inference error

Try removing __rust_cuda_ffi_safe_assert module

Ensure async launch mutable borrow safety with barriers on use and stream move

Fix uniqueness guarantee for Stream using branded types

Try without ref proj

Try add extract ref

Fix doc link

clean up kernel signature check

Some cleanup before merging

Fix some clippy lints, add FIXMEs for others

Add docs for rust-cuda-derive

Small refactoring + added docs for rust-cuda-kernel

Bump MSRV to 1.77-nightly

Try trait-based kernel signature check

Try naming host kernel layout const

Try match against byte literal for faster comparison

Try with memcmp intrinsic

Try out experimental const-type-layout with compression

Try check

Try check again
juntyr added a commit to juntyr/rust-cuda that referenced this pull request May 20, 2024
* Initial work on supporting some async memory transfers

Experiments with Rust Futures

Implemented derive for RustToCudaAsync

Implemented async kernel launch

Fixed RustToCudaAsync derive

LaunchPackage with non-mut Stream

Moved stream to be an explicit kernel argument

Updated ExchangeWrapperOn[Device|Host]Async::move_to_stream

Upgraded to fixed RustaCuda

Added scratch-space methods for uni-directional CudaExchangeItem

Added unsafe-aliasing API to SplitSlideOverCudaThreads[Const|Dynamic]Stride

Extended the CudaExchangeItem API with scratch and uMaybeUninit

Rename SplitSliceOverCudaThreads[Const|Dynamic]Strude::alias_[mut_]unchecked

Implemented #[cuda(crate)] and #[kernel(crate)] attributes

Added simple thread-block shared memory support

Fixed device utils doc tests

Convert cuda thread-block-shared memory address to generic

First steps towards better shared memory, including dynamic

Revert derive changes + R2C-based approach start

Some progress on shared slices

Backup of progress on compile-time PTX checking

Clean up the PTX JIT implementation

Add convenience functions for ThreadBlockShared arrays

Improve and fix CI

Remove broken ThreadBlockShared RustToCuda impl

Refactor kernel trait generation to push more safety constraints to the kernel definition

Fixed SomeCudaAlloc import

Added error handling to the compile-time PTX checking

Add PTX lint parsing, no actual support yet

Added lint checking support to monomorphised kernel impls

Improve kernel checking + added cubin dump lint

Fix kernel macro config parsing

Explicitly fitting Device[Const|Mut]Ref into device registers

Switched one std:: to core::

Remove register-sized CUDA kernel args check, unnecessary since rust-lang/rust#94703

Simplified the kernel parameter layout extraction from PTX

Fix up rebase issues

Install CUDA in all CI steps

Use CStr literals

Simplify and document the safety traits

Fix move_to_cuda bound

Fix clippy for 1.76

Cleaned up the rust-cuda device macros with better print

The implementation still uses String for dynamic formatting, which
currently pulls in loads of formatting and panic machinery.

While a custom String type that pre-allocated the exact format String
length can avoid some of that, the formatting machinery even for e.g.
usize is still large.

If `format_args!` is ever optimised for better inlining, the more
verbose and lower-level implementation could be reconsidered.

Switch to using more vprintf in embedded CUDA kernel

Make print example fully executable

Clean up the print example

ptr_from_ref is stable from 1.76

Exit on CUDA panic instead of abort to allow the host to handle the error

Backup of early progress for switching from kernel traits to functions

More work into kernel functions instead of traits

Eliminate almost all ArgsTrait usages

Some refactoring of the async kernel func type + wrap code

Early sketch of extracting type wrapping from macro into types and traits

Early work towards using trait for kernel type wrap, ptx jit workaround missing

Lift complete CPU kernel wrapper from proc macro into public functions

Add async launch helper

Further cleanup of the new kernel param API

Start cleaning up the public API

Allow passing ThreadBlockShared to kernels again

Remove unsound mutable lending to CUDA for now

Allow passing ThreadBlockSharedSlice to kernel for dynamic shared memory

Begin refactoring the public API with device feature

Refactoring to prepare for better module structure

Extract kernel module just for parameters

Add RustToCuda impls for &T, &mut T, &[T], and &mut [T] where T: RustToCuda

Large restructuring of the module layout for rust-cuda

Split rust-cuda-kernel off from rust-cuda-derive

Update codecov action to handle rust-cuda-kernel

Fix clippy lint

Far too much time spent getting rid of DeviceCopy

More refactoring and auditing kernel param bounds

First exploration towards a stricter async CUDA API

More experiments with async API

Further API experimentation

Further async API experimentation

Further async API design work

Add RustToCudaAsync impls for &T and &[T], but not &mut T or &mut [T]

Add back mostly unchanged exchange wrapper + buffer with RustToCudaAsync impls

Add back mostly unchanged anti-aliasing types with RustToCudaAsync impls

Progress on replacing ...Async with Async<...>

Seal more implementation details

Further small API improvements

Add AsyncProj helper API struct for async projections

Disable async derive in examples for now

Implement RustToCudaAsync derive impls

Further async API improvements to add drop behaviour

First sketch of the safety constraints of a new NoSafeAliasing trait

First steps towards reintroducing LendToCudaMut

Fix no-std Box import for LendRustToCuda derive

Re-add RustToCuda implementation for Final

Remove redundant RustToCudaAsyncProxy

More progress on less 'static bounds on kernel params

Further investigation of less 'static bounds

Remove 'static bounds from LendToCuda ref kernel params

Make CudaExchangeBuffer Sync

Make CudaExchangeBuffer Sync v2

Add AsyncProj proj_ref and proj_mut convenience methods

Add RustToCudaWithPortableBitCloneSemantics adapter

Fix invalid const fn bounds

Add Deref[Mut] to the adapters

Fix pointer type inference error

Try removing __rust_cuda_ffi_safe_assert module

Ensure async launch mutable borrow safety with barriers on use and stream move

Fix uniqueness guarantee for Stream using branded types

Try without ref proj

Try add extract ref

Fix doc link

clean up kernel signature check

Some cleanup before merging

Fix some clippy lints, add FIXMEs for others

Add docs for rust-cuda-derive

Small refactoring + added docs for rust-cuda-kernel

Bump MSRV to 1.77-nightly

Try trait-based kernel signature check

Try naming host kernel layout const

Try match against byte literal for faster comparison

Try with memcmp intrinsic

Try out experimental const-type-layout with compression

Try check

Try check again

* Fix CUDA install in CI

* Switch from kernel type signature check to random hash

* Fix CI-identified failures

* Use pinned nightly in CI

* Try splitting the kernel func signature type check

* Try with llvm-bitcode-linker

* Upgrade to latest ptx-builder

* Fix codecov by excluding ptx tests (codecov weirdly overrides linker)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-NVPTX Target: the NVPTX LLVM backend for running rust on GPUs, https://llvm.org/docs/NVPTXUsage.html S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet