Skip to content

Improve memory access patterns for index operations. #4493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 23, 2018

Conversation

yongjik
Copy link
Contributor

@yongjik yongjik commented Jan 5, 2018

Currently, index operation kernels work in "source/destination index-major
order". (E.g., if thread count equals slice size, each thread will process
slice #0 in lockstep, and then slice #1, and so on.)

However, when elements inside each "slice" is separated by large strides (e.g.,
selecting columns of a matrix), it is better to switch to "elementInSlice-major
order". For example, each thread can process element #0 of every slice, and
then element #1 of every slice, and so on.

Currently, index operation kernels work in "source/destination index-major
order".  (E.g., if thread count equals slice size, each thread will process
slice #0 in lockstep, and then slice pytorch#1, and so on.)

However, when elements inside each "slice" is separated by large strides (e.g.,
selecting columns of a matrix), it is better to switch to "elementInSlice-major
order".  For example, each thread can process element #0 of every slice, and
then element pytorch#1 of every slice, and so on.
@yongjik
Copy link
Contributor Author

yongjik commented Jan 5, 2018

I have measured execution times for various configurations, and found that the performance sometimes improves a lot, while it almost never worsens.

A somewhat realistic example of big improvement is:

B = torch.cuda.FloatTensor(2000, 2000)
A = torch.cuda.FloatTensor(2000, 500)

idxs = (np.arange(500) * 7) % 500
idxs = torch.cuda.LongTensor(idxs)

B.index_add_(1, idxs, A)
# On GTX 1080, execution time decreases 88% (1.195 ms -> 0.138 ms).

One (arguably pathological) case where performance degrades is:

B = torch.cuda.FloatTensor(2000, 4)
idxs = torch.cuda.LongTensor(np.zeros(40))
B.index_fill_(1, idxs, 1.0)
# On GTX 1080, execution time increases 22% (13.45 us -> 16.41 us).

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a nice idea, but it doesn't seem to be a clear win for me. In general it's hard to decide what's a better option offline (without looking at contents of the index). In your first example it happily happens that the index is strided, but exactly covers the first (only!) 500 slices, and caching is probably going to be a big win in this case. However, I'm a bit afraid that if only the index stride is slightly smaller than all of intra-slice strides, but the indices are very far apart, we will see a sharp drop in perf. I'd be curious to see more detailed benchmarks, with less regular access patterns that are not so evenly spread over the tensor.

Thanks!

LARGE_INDEX(real, unsigned int, 2, 2, -2, true);
} else {
LARGE_INDEX(real, unsigned int, 2, 2, -2, false);
}

This comment was marked as off-topic.

This comment was marked as off-topic.

}
else {
elementInSlice = linearIndex / innerSize;
srcIndex = linearIndex % innerSize;

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@yongjik
Copy link
Contributor Author

yongjik commented Jan 5, 2018

Hi @apaszke thanks for the review. You raise a good argument.

I have checked several hundred random(-ish) configurations, and found that the performance almost never becomes visibly worse, except for the weird case I mentioned above. I'm currently running more tests including bigger strides; I'll update the thread when I get the result.

@apaszke
Copy link
Contributor

apaszke commented Jan 5, 2018

Thanks! Looking forward to benchmark results!

@yongjik
Copy link
Contributor Author

yongjik commented Jan 11, 2018

Hi, sorry for the delay, but I have run some experiments and here's the result: https://github.com/yongjik/pt_test/tree/master/results/index_select

Here's some highlights.

Among the 81869 configurations I ran some index operations, kernel execution time changed by:

  =========================
  BAD   > 20%        :     6
  Bad   10 .. 20%    :    15
  bad   5 .. 10%     :    34
        -5 .. 5%     : 79785
  good  -10 .. -5%   :   744
  Good  -20 .. -10%  :   482
  GOOD  -50% .. -20% :   472
  BEST  < -50%       :   331

Moreover, all six BAD cases (execution time increased by up to 21%) belong to the weird index_fill_ I mentioned in my first comment above. It's possible that I'm missing some cases: after all, I created all these tests by going through permutations, so there's a lot of "variations of the same theme". However, the trend seems quite positive to me.

A realistic example is in line 84354 of https://raw.githubusercontent.com/yongjik/pt_test/master/results/index_select/SCORES.txt

(Sorry for the data dump. I wasn't planning to show it to anyone, so the format is rather ugly. You might want to download the file and open it in some editor.)

Here, we have source tensor of (512, 255) and dest tensor of (512, 256), applying index operations on columns. Everything I can throw at it shows 9-15% improvement. See the first link above for more examples. (Some large tensors show up to 13x improvement.)

@apaszke
Copy link
Contributor

apaszke commented Jan 11, 2018

That's great thanks for doing such a detailed benchmark! Do you know why is this single case slower after this patch?

@yongjik
Copy link
Contributor Author

yongjik commented Jan 11, 2018

Well, in that example index_fill_() is trying to write to the same column 40 times, so of course it will lead to write conflicts. In the original code, at least these 40 writes are spread far apart (probably in different thread blocks), but in the new code they are executed by 40 consecutive kernel threads.

However, I think one can reasonably argue that such a use of index_fill_ is "broken" performance-wise, and that we cannot expect the code to be fast anyway.

@yongjik
Copy link
Contributor Author

yongjik commented Jan 16, 2018

@apaszke Hi, any more thoughts on this PR?

@yongjik
Copy link
Contributor Author

yongjik commented Jan 22, 2018

Hi @apaszke,

Just wondering if I'm supposed to do something, or if you've been too busy to decide on this...?

@soumith
Copy link
Member

soumith commented Jan 22, 2018

we should merge this in.

@soumith
Copy link
Member

soumith commented Jan 22, 2018

@pytorchbot test this please

@apaszke
Copy link
Contributor

apaszke commented Jan 22, 2018

@yongjik sorry I didn't have time to review this again. It looks good at a glance, so we can merge it

@yongjik
Copy link
Contributor Author

yongjik commented Jan 22, 2018

No worries! I just wanted to be sure that you weren't waiting for me to do something, because then we'd be in deadlock forever... :)

@soumith soumith merged commit 966db35 into pytorch:master Jan 23, 2018
@soumith
Copy link
Member

soumith commented Jan 23, 2018

thanks a lot @yongjik !

@yongjik yongjik deleted the 0104-idxselect branch February 5, 2018 18:04
@soumith soumith added the 0.3.1 label Feb 5, 2018
soumith pushed a commit that referenced this pull request Feb 7, 2018
Currently, index operation kernels work in "source/destination index-major
order".  (E.g., if thread count equals slice size, each thread will process
slice #0 in lockstep, and then slice #1, and so on.)

However, when elements inside each "slice" is separated by large strides (e.g.,
selecting columns of a matrix), it is better to switch to "elementInSlice-major
order".  For example, each thread can process element #0 of every slice, and
then element #1 of every slice, and so on.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants