Seeking advice on implementing certain operators on GPU

I am considering implementing some operators of bottleneck on the GPU using libraries such as pytorch, cupy, and perhaps CUDA or [triton](https://github.com/openai/triton). 

Specifically, for the "move" series of operators, when working with large data sizes, using pytorch (on GPU) can significantly accelerate the process. (I implemented a sliding window using [`unfold`](https://pytorch.org/docs/stable/generated/torch.Tensor.unfold.html).)

However, I've encountered some difficulties while trying to implement `rankdata`, `nanrankdata`, and `push` operators. The performance is not as good as expected (In fact, it is much slower than the implementation in bottleneck.), and I suspect that the for-loops within these implementations might be causing the slowdown. 

Do you have any suggestions or recommendations on how to efficiently implement these operators on the GPU?


2 / 2





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Seeking advice on implementing certain operators on GPU #446

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Seeking advice on implementing certain operators on GPU #446

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions