-
Notifications
You must be signed in to change notification settings - Fork 24.9k
Improve memory access patterns for index operations. #4493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Currently, index operation kernels work in "source/destination index-major order". (E.g., if thread count equals slice size, each thread will process slice #0 in lockstep, and then slice pytorch#1, and so on.) However, when elements inside each "slice" is separated by large strides (e.g., selecting columns of a matrix), it is better to switch to "elementInSlice-major order". For example, each thread can process element #0 of every slice, and then element pytorch#1 of every slice, and so on.
I have measured execution times for various configurations, and found that the performance sometimes improves a lot, while it almost never worsens. A somewhat realistic example of big improvement is:
One (arguably pathological) case where performance degrades is:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a nice idea, but it doesn't seem to be a clear win for me. In general it's hard to decide what's a better option offline (without looking at contents of the index). In your first example it happily happens that the index is strided, but exactly covers the first (only!) 500 slices, and caching is probably going to be a big win in this case. However, I'm a bit afraid that if only the index stride is slightly smaller than all of intra-slice strides, but the indices are very far apart, we will see a sharp drop in perf. I'd be curious to see more detailed benchmarks, with less regular access patterns that are not so evenly spread over the tensor.
Thanks!
LARGE_INDEX(real, unsigned int, 2, 2, -2, true); | ||
} else { | ||
LARGE_INDEX(real, unsigned int, 2, 2, -2, false); | ||
} |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
} | ||
else { | ||
elementInSlice = linearIndex / innerSize; | ||
srcIndex = linearIndex % innerSize; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
Hi @apaszke thanks for the review. You raise a good argument. I have checked several hundred random(-ish) configurations, and found that the performance almost never becomes visibly worse, except for the weird case I mentioned above. I'm currently running more tests including bigger strides; I'll update the thread when I get the result. |
Thanks! Looking forward to benchmark results! |
Hi, sorry for the delay, but I have run some experiments and here's the result: https://github.com/yongjik/pt_test/tree/master/results/index_select Here's some highlights. Among the 81869 configurations I ran some index operations, kernel execution time changed by:
Moreover, all six BAD cases (execution time increased by up to 21%) belong to the weird A realistic example is in line 84354 of https://raw.githubusercontent.com/yongjik/pt_test/master/results/index_select/SCORES.txt (Sorry for the data dump. I wasn't planning to show it to anyone, so the format is rather ugly. You might want to download the file and open it in some editor.) Here, we have source tensor of (512, 255) and dest tensor of (512, 256), applying index operations on columns. Everything I can throw at it shows 9-15% improvement. See the first link above for more examples. (Some large tensors show up to 13x improvement.) |
That's great thanks for doing such a detailed benchmark! Do you know why is this single case slower after this patch? |
Well, in that example However, I think one can reasonably argue that such a use of |
@apaszke Hi, any more thoughts on this PR? |
Hi @apaszke, Just wondering if I'm supposed to do something, or if you've been too busy to decide on this...? |
we should merge this in. |
@pytorchbot test this please |
@yongjik sorry I didn't have time to review this again. It looks good at a glance, so we can merge it |
No worries! I just wanted to be sure that you weren't waiting for me to do something, because then we'd be in deadlock forever... :) |
thanks a lot @yongjik ! |
Currently, index operation kernels work in "source/destination index-major order". (E.g., if thread count equals slice size, each thread will process slice #0 in lockstep, and then slice #1, and so on.) However, when elements inside each "slice" is separated by large strides (e.g., selecting columns of a matrix), it is better to switch to "elementInSlice-major order". For example, each thread can process element #0 of every slice, and then element #1 of every slice, and so on.
Currently, index operation kernels work in "source/destination index-major
order". (E.g., if thread count equals slice size, each thread will process
slice #0 in lockstep, and then slice #1, and so on.)
However, when elements inside each "slice" is separated by large strides (e.g.,
selecting columns of a matrix), it is better to switch to "elementInSlice-major
order". For example, each thread can process element #0 of every slice, and
then element #1 of every slice, and so on.