Improve memory access patterns for index operations. #4493

yongjik · 2018-01-05T04:08:32Z

Currently, index operation kernels work in "source/destination index-major
order". (E.g., if thread count equals slice size, each thread will process
slice #0 in lockstep, and then slice #1, and so on.)

However, when elements inside each "slice" is separated by large strides (e.g.,
selecting columns of a matrix), it is better to switch to "elementInSlice-major
order". For example, each thread can process element #0 of every slice, and
then element #1 of every slice, and so on.

Currently, index operation kernels work in "source/destination index-major order". (E.g., if thread count equals slice size, each thread will process slice #0 in lockstep, and then slice pytorch#1, and so on.) However, when elements inside each "slice" is separated by large strides (e.g., selecting columns of a matrix), it is better to switch to "elementInSlice-major order". For example, each thread can process element #0 of every slice, and then element pytorch#1 of every slice, and so on.

yongjik · 2018-01-05T04:10:01Z

I have measured execution times for various configurations, and found that the performance sometimes improves a lot, while it almost never worsens.

A somewhat realistic example of big improvement is:

B = torch.cuda.FloatTensor(2000, 2000)
A = torch.cuda.FloatTensor(2000, 500)

idxs = (np.arange(500) * 7) % 500
idxs = torch.cuda.LongTensor(idxs)

B.index_add_(1, idxs, A)
# On GTX 1080, execution time decreases 88% (1.195 ms -> 0.138 ms).

One (arguably pathological) case where performance degrades is:

B = torch.cuda.FloatTensor(2000, 4)
idxs = torch.cuda.LongTensor(np.zeros(40))
B.index_fill_(1, idxs, 1.0)
# On GTX 1080, execution time increases 22% (13.45 us -> 16.41 us).

apaszke

It's a nice idea, but it doesn't seem to be a clear win for me. In general it's hard to decide what's a better option offline (without looking at contents of the index). In your first example it happily happens that the index is strided, but exactly covers the first (only!) 500 slices, and caching is probably going to be a big win in this case. However, I'm a bit afraid that if only the index stride is slightly smaller than all of intra-slice strides, but the indices are very far apart, we will see a sharp drop in perf. I'd be curious to see more detailed benchmarks, with less regular access patterns that are not so evenly spread over the tensor.

Thanks!

aten/src/THC/generic/THCTensorIndex.cu

+          LARGE_INDEX(real, unsigned int, 2, 2, -2, true);
+        } else {
+          LARGE_INDEX(real, unsigned int, 2, 2, -2, false);
+        }


aten/src/THC/THCTensorIndex.cu

+    }
+    else {
+      elementInSlice = linearIndex / innerSize;
+      srcIndex = linearIndex % innerSize;


yongjik · 2018-01-05T22:36:31Z

Hi @apaszke thanks for the review. You raise a good argument.

I have checked several hundred random(-ish) configurations, and found that the performance almost never becomes visibly worse, except for the weird case I mentioned above. I'm currently running more tests including bigger strides; I'll update the thread when I get the result.

apaszke · 2018-01-05T23:22:14Z

Thanks! Looking forward to benchmark results!

yongjik · 2018-01-11T01:54:12Z

Hi, sorry for the delay, but I have run some experiments and here's the result: https://github.com/yongjik/pt_test/tree/master/results/index_select

Here's some highlights.

Among the 81869 configurations I ran some index operations, kernel execution time changed by:

  =========================
  BAD   > 20%        :     6
  Bad   10 .. 20%    :    15
  bad   5 .. 10%     :    34
        -5 .. 5%     : 79785
  good  -10 .. -5%   :   744
  Good  -20 .. -10%  :   482
  GOOD  -50% .. -20% :   472
  BEST  < -50%       :   331

Moreover, all six BAD cases (execution time increased by up to 21%) belong to the weird index_fill_ I mentioned in my first comment above. It's possible that I'm missing some cases: after all, I created all these tests by going through permutations, so there's a lot of "variations of the same theme". However, the trend seems quite positive to me.

A realistic example is in line 84354 of https://raw.githubusercontent.com/yongjik/pt_test/master/results/index_select/SCORES.txt

(Sorry for the data dump. I wasn't planning to show it to anyone, so the format is rather ugly. You might want to download the file and open it in some editor.)

Here, we have source tensor of (512, 255) and dest tensor of (512, 256), applying index operations on columns. Everything I can throw at it shows 9-15% improvement. See the first link above for more examples. (Some large tensors show up to 13x improvement.)

apaszke · 2018-01-11T08:24:06Z

That's great thanks for doing such a detailed benchmark! Do you know why is this single case slower after this patch?

yongjik · 2018-01-11T19:17:41Z

Well, in that example index_fill_() is trying to write to the same column 40 times, so of course it will lead to write conflicts. In the original code, at least these 40 writes are spread far apart (probably in different thread blocks), but in the new code they are executed by 40 consecutive kernel threads.

However, I think one can reasonably argue that such a use of index_fill_ is "broken" performance-wise, and that we cannot expect the code to be fast anyway.

yongjik · 2018-01-16T18:53:13Z

@apaszke Hi, any more thoughts on this PR?

yongjik · 2018-01-22T21:36:12Z

Hi @apaszke,

Just wondering if I'm supposed to do something, or if you've been too busy to decide on this...?

soumith · 2018-01-22T21:41:09Z

we should merge this in.

soumith · 2018-01-22T21:41:27Z

@pytorchbot test this please

apaszke · 2018-01-22T23:36:18Z

@yongjik sorry I didn't have time to review this again. It looks good at a glance, so we can merge it

yongjik · 2018-01-22T23:41:30Z

No worries! I just wanted to be sure that you weren't waiting for me to do something, because then we'd be in deadlock forever... :)

soumith · 2018-01-23T01:47:26Z

thanks a lot @yongjik !

Currently, index operation kernels work in "source/destination index-major order". (E.g., if thread count equals slice size, each thread will process slice #0 in lockstep, and then slice #1, and so on.) However, when elements inside each "slice" is separated by large strides (e.g., selecting columns of a matrix), it is better to switch to "elementInSlice-major order". For example, each thread can process element #0 of every slice, and then element #1 of every slice, and so on.

apaszke reviewed Jan 5, 2018

View reviewed changes

soumith merged commit 966db35 into pytorch:master Jan 23, 2018

yongjik deleted the 0104-idxselect branch February 5, 2018 18:04

soumith added the 0.3.1 label Feb 5, 2018

ezyang added the open source label Jun 24, 2019

Improve memory access patterns for index operations. #4493

Improve memory access patterns for index operations. #4493

Uh oh!

Conversation

yongjik commented Jan 5, 2018

Uh oh!

yongjik commented Jan 5, 2018

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

yongjik commented Jan 5, 2018

Uh oh!

apaszke commented Jan 5, 2018

Uh oh!

yongjik commented Jan 11, 2018

Uh oh!

apaszke commented Jan 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yongjik commented Jan 11, 2018

Uh oh!

yongjik commented Jan 16, 2018

Uh oh!

yongjik commented Jan 22, 2018

Uh oh!

soumith commented Jan 22, 2018

Uh oh!

soumith commented Jan 22, 2018

Uh oh!

apaszke commented Jan 22, 2018

Uh oh!

yongjik commented Jan 22, 2018

Uh oh!

soumith commented Jan 23, 2018

Uh oh!

Uh oh!

apaszke commented Jan 11, 2018 •

edited

Loading