Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any parallel sorting on the GPU with Taichi or thirdparty ? #3764

Open
GrapixLeGrand opened this issue Dec 9, 2021 · 12 comments
Open

Any parallel sorting on the GPU with Taichi or thirdparty ? #3764

GrapixLeGrand opened this issue Dec 9, 2021 · 12 comments
Labels
question Question on using Taichi

Comments

@GrapixLeGrand
Copy link

Hello,

I could not find a thread (nor in the docs) that would cover this subject, therefore I created one. Is there any parallel sorting algorithm in taichi? I'm looking for a radix/counting sort. Is there a known taichi implementation of a radix/counting sort out there? I was considering writing my own following this paper. However, according to the doc (under the atomics section), the atomic assign operation is missing and it would be needed for step 4 of the algorithm b[c[r[a[i]]]++] = a[i];.

Best,
Quentin

@GrapixLeGrand GrapixLeGrand added the question Question on using Taichi label Dec 9, 2021
@qiao-bo
Copy link
Collaborator

qiao-bo commented Dec 9, 2021

Hi Quentin,
I am not aware of an existing GPU-accelerated sorting algo in Taichi, perhaps other people can comment on this.

Regarding to the atomic assign operation, would this work (doc)?

b[ti.atomic_add(c[b[a[i]]], 1)] = a[i]

@GrapixLeGrand
Copy link
Author

Thanks for the reply. I'll give it a try but in fact the paper that I cited proposed a solution without the atomic, do you know how to get the maximum amount of thread? It would replace b[ti.atomic_add(c[b[a[i]]], 1)] = a[i] by:
Capture_ok

If this is not possible then I'll try the atomic. What you proposed seems very reasonable, oddly the paper says that you need two atomics operations to perform this. This is wrong isn't it? Am I mistaken?
Best,
Quentin

@GrapixLeGrand
Copy link
Author

Hello back,

It worked for the atomics, thanks! Now, do you know how to get the max number of global threads?
Best,
Quentin

@qiao-bo
Copy link
Collaborator

qiao-bo commented Dec 10, 2021

Hi,

I need to take a deeper look into the paper, but I think Taichi hasn't exposed the max num of global threads. A Taichi kernel may be compiled to e.g., multiple CUDA kernels and they could have different number of threads. But this is roughly how we calculated (code link):

grid_dim = min(#SMs * 32, num_threads_needed) // see above code link
block_dim = 128 by default or specified by ti.block_dim() otherwise

This may not help much, we do have a global_thread_idx() available.

@qiao-bo
Copy link
Collaborator

qiao-bo commented Dec 10, 2021

We do think a sort intrinsic can be useful for many Taichi users. We are considering add some parallel intrinsics to the language. Let us know what else you find useful ;)

@turbo0628
Copy link
Member

We do think a sort intrinsic can be useful for many Taichi users. We are considering add some parallel intrinsics to the language. Let us know what else you find useful ;)

Is that possible to add some hooks to for external libraries? Many such math functions / permutation utilities are already implemented in vendor libraries, CUB for example.

@qiao-bo
Copy link
Collaborator

qiao-bo commented Dec 10, 2021

We do think a sort intrinsic can be useful for many Taichi users. We are considering add some parallel intrinsics to the language. Let us know what else you find useful ;)

Is that possible to add some hooks to for external libraries? Many such math functions / permutation utilities are already implemented in vendor libraries, CUB for example.

Very good point! CUB exposes very efficient permutation utilities, and it's hard to be better ;). Although in Taichi we want to avoid Vendor lock-in, having some external libs could still be a good start...

@turbo0628
Copy link
Member

turbo0628 commented Dec 10, 2021

Very good point! CUB exposes very efficient permutation utilities, and it's hard to be better ;). Although in Taichi we want to avoid Vendor lock-in, having some external libs could still be a good start...

Integration of any platform-specific library would definitely make Taichi less consistent, especially when switching among hardware backends. I totally agree that vendor lock-in should be avoided for Taichi, especially in the current stage. The alternative approach is to mimic vendor library with some consistent primitives. However, it has already been proven to be a hard task in TVM. The TVM community recently adopts the BYOC mechanism for different vendor libraries. They are now actively working to bring in CUTLASS, oneDNN and ArmComputeLibrary.

If we widen the concept of vendor libraries, they are actually equivalent with user-written codes. The second problem raises here: how to reasonably hybridize compiler generated code with user-written / vendor code? Those codes are probably to be efficient but they are just black boxes for the compiler. I don't think there's already an efficient and elegant solution, hope to see that happen in Taichi :)

@bcolloran
Copy link

@GrapixLeGrand following your work here with interest! If you get a working implementation, please post details back here if you are able to share :-)

@k-ye
Copy link
Member

k-ye commented Dec 13, 2021

the BYOC mechanism for different vendor libraries. They are now actively working to bring in CUTLASS, oneDNN and ArmComputeLibrary.

This sounds like an interesting direction!

FYI we have some basic experiments regarding how to integrate with vendor/external libs. One example is the Sparse Matrix feature on top of Eigen (#2792). Another one being @squarefk 's external LLVM bitcode call support (#2873). Clearly, these are prototypes, rather than a systematic solution.

Realistically speaking, I don't think we should strive for a completely vendor-free implementation here, or the dev/maintainence cost is gonna be too high :-( I believe either the BYOC or provide a flexible extension system to wrap these vendor libs are worth a shot..

@AmesingFlank
Copy link
Collaborator

Implemented a version of odd-even parallel sort at #3790.

@rexwangcc rexwangcc added this to To do in Lang Features & Python via automation Feb 14, 2022
@rexwangcc rexwangcc moved this from To do to Done in Lang Features & Python Feb 14, 2022
@rexwangcc rexwangcc moved this from Done to To do in Lang Features & Python Feb 14, 2022
@FantasyVR
Copy link
Collaborator

FantasyVR commented Mar 23, 2022

Hi @GrapixLeGrand. We decide to make more effort on GPU sorting recently. Though we know users prefer to use counting sort/radix sort, we want to know if our odd-even sorting meets your need? If you can share your sorting implementation here, we are willing to help and provide some support in Taichi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question on using Taichi
Projects
Development

No branches or pull requests

7 participants