Any parallel sorting on the GPU with Taichi or thirdparty ? #3764

GrapixLeGrand · 2021-12-09T14:06:37Z

Hello,

I could not find a thread (nor in the docs) that would cover this subject, therefore I created one. Is there any parallel sorting algorithm in taichi? I'm looking for a radix/counting sort. Is there a known taichi implementation of a radix/counting sort out there? I was considering writing my own following this paper. However, according to the doc (under the atomics section), the atomic assign operation is missing and it would be needed for step 4 of the algorithm b[c[r[a[i]]]++] = a[i];.

Best,
Quentin

The text was updated successfully, but these errors were encountered:

qiao-bo · 2021-12-09T15:30:04Z

Hi Quentin,
I am not aware of an existing GPU-accelerated sorting algo in Taichi, perhaps other people can comment on this.

Regarding to the atomic assign operation, would this work (doc)?

b[ti.atomic_add(c[b[a[i]]], 1)] = a[i]

GrapixLeGrand · 2021-12-09T15:51:38Z

Thanks for the reply. I'll give it a try but in fact the paper that I cited proposed a solution without the atomic, do you know how to get the maximum amount of thread? It would replace b[ti.atomic_add(c[b[a[i]]], 1)] = a[i] by:

If this is not possible then I'll try the atomic. What you proposed seems very reasonable, oddly the paper says that you need two atomics operations to perform this. This is wrong isn't it? Am I mistaken?
Best,
Quentin

GrapixLeGrand · 2021-12-09T16:22:39Z

Hello back,

It worked for the atomics, thanks! Now, do you know how to get the max number of global threads?
Best,
Quentin

qiao-bo · 2021-12-10T02:17:24Z

Hi,

I need to take a deeper look into the paper, but I think Taichi hasn't exposed the max num of global threads. A Taichi kernel may be compiled to e.g., multiple CUDA kernels and they could have different number of threads. But this is roughly how we calculated (code link):

grid_dim = min(#SMs * 32, num_threads_needed) // see above code link
block_dim = 128 by default or specified by ti.block_dim() otherwise

This may not help much, we do have a global_thread_idx() available.

qiao-bo · 2021-12-10T02:20:21Z

We do think a sort intrinsic can be useful for many Taichi users. We are considering add some parallel intrinsics to the language. Let us know what else you find useful ;)

turbo0628 · 2021-12-10T15:48:14Z

We do think a sort intrinsic can be useful for many Taichi users. We are considering add some parallel intrinsics to the language. Let us know what else you find useful ;)

Is that possible to add some hooks to for external libraries? Many such math functions / permutation utilities are already implemented in vendor libraries, CUB for example.

qiao-bo · 2021-12-10T16:11:10Z

We do think a sort intrinsic can be useful for many Taichi users. We are considering add some parallel intrinsics to the language. Let us know what else you find useful ;)

Is that possible to add some hooks to for external libraries? Many such math functions / permutation utilities are already implemented in vendor libraries, CUB for example.

Very good point! CUB exposes very efficient permutation utilities, and it's hard to be better ;). Although in Taichi we want to avoid Vendor lock-in, having some external libs could still be a good start...

turbo0628 · 2021-12-10T16:36:47Z

Very good point! CUB exposes very efficient permutation utilities, and it's hard to be better ;). Although in Taichi we want to avoid Vendor lock-in, having some external libs could still be a good start...

Integration of any platform-specific library would definitely make Taichi less consistent, especially when switching among hardware backends. I totally agree that vendor lock-in should be avoided for Taichi, especially in the current stage. The alternative approach is to mimic vendor library with some consistent primitives. However, it has already been proven to be a hard task in TVM. The TVM community recently adopts the BYOC mechanism for different vendor libraries. They are now actively working to bring in CUTLASS, oneDNN and ArmComputeLibrary.

If we widen the concept of vendor libraries, they are actually equivalent with user-written codes. The second problem raises here: how to reasonably hybridize compiler generated code with user-written / vendor code? Those codes are probably to be efficient but they are just black boxes for the compiler. I don't think there's already an efficient and elegant solution, hope to see that happen in Taichi :)

bcolloran · 2021-12-13T01:42:18Z

@GrapixLeGrand following your work here with interest! If you get a working implementation, please post details back here if you are able to share :-)

k-ye · 2021-12-13T05:32:44Z

the BYOC mechanism for different vendor libraries. They are now actively working to bring in CUTLASS, oneDNN and ArmComputeLibrary.

This sounds like an interesting direction!

FYI we have some basic experiments regarding how to integrate with vendor/external libs. One example is the Sparse Matrix feature on top of Eigen (#2792). Another one being @squarefk 's external LLVM bitcode call support (#2873). Clearly, these are prototypes, rather than a systematic solution.

Realistically speaking, I don't think we should strive for a completely vendor-free implementation here, or the dev/maintainence cost is gonna be too high :-( I believe either the BYOC or provide a flexible extension system to wrap these vendor libs are worth a shot..

AmesingFlank · 2021-12-14T00:54:56Z

Implemented a version of odd-even parallel sort at #3790.

FantasyVR · 2022-03-23T05:35:53Z

Hi @GrapixLeGrand. We decide to make more effort on GPU sorting recently. Though we know users prefer to use counting sort/radix sort, we want to know if our odd-even sorting meets your need? If you can share your sorting implementation here, we are willing to help and provide some support in Taichi.

GrapixLeGrand added the question Question on using Taichi label Dec 9, 2021

AmesingFlank mentioned this issue Dec 14, 2021

[lang] Add parallel sort utility #3790

Merged

rexwangcc added this to To do in Lang Features & Python via automation Feb 14, 2022

rexwangcc moved this from To do to Done in Lang Features & Python Feb 14, 2022

rexwangcc moved this from Done to To do in Lang Features & Python Feb 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any parallel sorting on the GPU with Taichi or thirdparty ? #3764

Any parallel sorting on the GPU with Taichi or thirdparty ? #3764

GrapixLeGrand commented Dec 9, 2021

qiao-bo commented Dec 9, 2021 •

edited

GrapixLeGrand commented Dec 9, 2021

GrapixLeGrand commented Dec 9, 2021

qiao-bo commented Dec 10, 2021 •

edited

qiao-bo commented Dec 10, 2021

turbo0628 commented Dec 10, 2021

qiao-bo commented Dec 10, 2021

turbo0628 commented Dec 10, 2021 •

edited

bcolloran commented Dec 13, 2021

k-ye commented Dec 13, 2021

AmesingFlank commented Dec 14, 2021

FantasyVR commented Mar 23, 2022 •

edited

Any parallel sorting on the GPU with Taichi or thirdparty ? #3764

Any parallel sorting on the GPU with Taichi or thirdparty ? #3764

Comments

GrapixLeGrand commented Dec 9, 2021

qiao-bo commented Dec 9, 2021 • edited

GrapixLeGrand commented Dec 9, 2021

GrapixLeGrand commented Dec 9, 2021

qiao-bo commented Dec 10, 2021 • edited

qiao-bo commented Dec 10, 2021

turbo0628 commented Dec 10, 2021

qiao-bo commented Dec 10, 2021

turbo0628 commented Dec 10, 2021 • edited

bcolloran commented Dec 13, 2021

k-ye commented Dec 13, 2021

AmesingFlank commented Dec 14, 2021

FantasyVR commented Mar 23, 2022 • edited

qiao-bo commented Dec 9, 2021 •

edited

qiao-bo commented Dec 10, 2021 •

edited

turbo0628 commented Dec 10, 2021 •

edited

FantasyVR commented Mar 23, 2022 •

edited