Improve compare kernel #29743

zasdfgbnm · 2019-11-13T20:00:04Z

Currently, the way the compare kernels handle dtypes is very funny (this behavior is introduced in #28427 and I just realize it today):

Let's say a, b are two float tensors on CUDA.

If you do a < b, this is what would happen inside the loop:

Step 1: Fetch a and b, dynamically cast them from float to float. (i.e. check the scalar type to figure out if it needs cast. it doesn't. so do nothing then.)
Step 2: compute a < b, get a bool result
Step 3: statically cast the result into float
Step 3: do a dynamic cast of the result from float to bool and store the value

And if you do a.lt_(b), this is what would happen:

Step 1: Fetch a and b, no casting
Step 2: compute a < b, get a bool result
Step 3: statically cast the result into float
Step 4: store the result to memory, no casting

Although dynamic casting happens on registers, it still hurt the performance a bit (~8%).

This PR fixes this issue. Now for compare kernels, if the output is bool and inputs have the same dtype, then there is no dynamic casting. Otherwise, there will be dynamic casting for each input and output. That is, the dynamic casting behavior of the two cases described above are swapped.

Benchmark on a < b for tensor of 1000000000 fp32 elements:
Before #28427 6.35 ms
Current master: 6.88 ms
With this PR: 6.36 ms
Benchmark on a.lt_(b) does not show any difference across versions.

Besides this, what worries me most is, with type promotion, the logic for tensor iterator is becoming super complicated, and it is hard to see if one change causes the performance regression of others. I suggest we create scripts that could benchmark tensor iterator entirely, review that code and put it somewhere inside the repository (maybe under /tools or /test/scripts?), and whenever we are not certain about the performance we could run it to check. (I guess not on this PR but on PRs after the script is done. If there are worries about performance, the author of PRs should run the script manually, and the reviewer should remind PR author to do so if necessary) If this is a good idea, I will send a PR for the script.

ngimel · 2019-11-20T06:34:17Z

The PR looks good. You are right, type promotion logic is becoming very complicated, so it would be good to have a benchmarking script (under /benchmarks looks like a natural place).

VitalyFedyunin · 2019-11-21T19:42:03Z

Will merge conflict with #30065

zasdfgbnm · 2019-11-21T19:49:22Z

@VitalyFedyunin I just saw #30065, I suggest land #30065 first because I feel it might be easier to fix merge conflicts here. (The merge conflict would tell me about scalar_type usage in this PR more clearly. Doing the opposite might allow this PR to silently introduce new scalar_type usage)

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zasdfgbnm added 6 commits November 12, 2019 10:13

Improve compare kernels

f19a8f1

Merge branch 'master' of github.com:pytorch/pytorch into compare

2ced129

dont do dynamic cast

f93844f

fix

0d4108c

Merge branch 'master' of github.com:pytorch/pytorch into compare

24ae09d

revert tensoriterator change

25b832b

zasdfgbnm closed this Nov 13, 2019

try

e1bb255

zasdfgbnm reopened this Nov 14, 2019

Fix logical xor

f6a9319

zasdfgbnm changed the title ~~[WIP] Improve compare kernel~~ Improve compare kernel Nov 14, 2019

zasdfgbnm marked this pull request as ready for review November 14, 2019 20:19

zasdfgbnm requested review from VitalyFedyunin and ngimel November 14, 2019 20:24

ngimel approved these changes Nov 20, 2019

View reviewed changes

zasdfgbnm mentioned this pull request Nov 21, 2019

Add scripts for comprehensive benchmark TensorIterator #30248

Open

Merge branch 'master' into compare

0f5f75f

facebook-github-bot reviewed Nov 23, 2019

View reviewed changes

facebook-github-bot closed this in 25f4ba7 Nov 25, 2019

zasdfgbnm deleted the compare branch November 26, 2019 07:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve compare kernel #29743

Improve compare kernel #29743

zasdfgbnm commented Nov 13, 2019 •

edited

ngimel commented Nov 20, 2019

VitalyFedyunin commented Nov 21, 2019

zasdfgbnm commented Nov 21, 2019 •

edited

facebook-github-bot left a comment

Improve compare kernel #29743

Improve compare kernel #29743

Conversation

zasdfgbnm commented Nov 13, 2019 • edited

ngimel commented Nov 20, 2019

VitalyFedyunin commented Nov 21, 2019

zasdfgbnm commented Nov 21, 2019 • edited

facebook-github-bot left a comment

Choose a reason for hiding this comment

zasdfgbnm commented Nov 13, 2019 •

edited

zasdfgbnm commented Nov 21, 2019 •

edited