Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix copy kernel speed regression introduced in #29631 #31279

Closed
wants to merge 2 commits into from

Conversation

zasdfgbnm
Copy link
Collaborator

@zasdfgbnm zasdfgbnm commented Dec 14, 2019

Fixes #31271

This fixes copy kernel speed regression introduced in #29631.

The previous implementation forces the compiler to instantiate static_cast_with_inter_type because it is passed as an argument of a function. This behavior makes it impossible for compilers to do optimizations like automatic vectorization, and, function call itself is expensive compared to a single casting instruction.

To check the change, run

readelf -Ws /home/xgao/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so | grep static_cast_with_inter_type

On nightly build, we have output

168217: 0000000001852bf0     5 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIsdE5applyEd
168816: 0000000001852d30    33 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeISt7complexIfEaE5applyEa
168843: 00000000018531f0     7 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIblE5applyEl
168930: 0000000001852c20     3 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIslE5applyEl
168935: 00000000018528d0   124 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIfNS_4HalfEE5applyES1_
169023: 0000000001852f30    17 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeISt7complexIdEhE5applyEh
169713: 00000000018525c0     3 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIahE5applyEh
170033: 0000000001852c10     3 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIsiE5applyEi
170105: 0000000001852bd0     5 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIshE5applyEh
170980: 0000000001852fc0    27 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeISt7complexIdES1_IfEE5applyES3_
171398: 0000000001852810    13 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIdbE5applyEb
171574: 00000000018532e0    35 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIbNS_8BFloat16EE5applyES1_
171734: 0000000001852b20     6 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIlSt7complexIdEE5applyES2_
172422: 0000000001853350    54 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeINS_8BFloat16EaE5applyEa
172704: 00000000018533c0    38 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeINS_8BFloat16EfE5applyEf
172976: 0000000001852890    10 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIflE5applyEl
173038: 0000000001852f80     9 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeISt7complexIdEfE5applyEf
173329: 00000000018531c0    20 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIbfE5applyEf
173779: 00000000018524d0     3 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIhiE5applyEi
174032: 0000000001852960    14 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIfNS_8BFloat16EE5applyES1_
174334: 0000000001852d60    29 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeISt7complexIfEdE5applyEd
174470: 0000000001852c60   124 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIsNS_4HalfEE5applyES1_
174770: 0000000001852bc0    15 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIlNS_8BFloat16EE5applyES1_
176408: 0000000001853980   144 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeINS_4HalfEbE5applyEb
176475: 0000000001852790   128 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIdNS_4HalfEE5applyES1_
....

And after this PR, we get empty output

Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thanks!

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@kostmo
Copy link
Member

kostmo commented Dec 15, 2019

CircleCI build failures summary

As of commit f734bbe:

  • 0/0 failures introduced in this PR

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

@zasdfgbnm zasdfgbnm deleted the fix-copykernel branch December 16, 2019 18:29
zasdfgbnm added a commit to zasdfgbnm/pytorch that referenced this pull request Dec 16, 2019
…#31279)

Summary:
Fixes pytorch#31271

This fixes copy kernel speed regression introduced in pytorch#29631.

The previous implementation forces the compiler to instantiate `static_cast_with_inter_type` because it is passed as an argument of a function. This behavior makes it impossible for compilers to do optimizations like automatic vectorization, and, function call itself is expensive compared to a single casting instruction.

To check the change, run
```
readelf -Ws /home/xgao/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so | grep static_cast_with_inter_type
```

On nightly build, we have output
```
168217: 0000000001852bf0     5 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIsdE5applyEd
168816: 0000000001852d30    33 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeISt7complexIfEaE5applyEa
168843: 00000000018531f0     7 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIblE5applyEl
168930: 0000000001852c20     3 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIslE5applyEl
168935: 00000000018528d0   124 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIfNS_4HalfEE5applyES1_
169023: 0000000001852f30    17 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeISt7complexIdEhE5applyEh
169713: 00000000018525c0     3 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIahE5applyEh
170033: 0000000001852c10     3 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIsiE5applyEi
170105: 0000000001852bd0     5 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIshE5applyEh
170980: 0000000001852fc0    27 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeISt7complexIdES1_IfEE5applyES3_
171398: 0000000001852810    13 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIdbE5applyEb
171574: 00000000018532e0    35 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIbNS_8BFloat16EE5applyES1_
171734: 0000000001852b20     6 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIlSt7complexIdEE5applyES2_
172422: 0000000001853350    54 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeINS_8BFloat16EaE5applyEa
172704: 00000000018533c0    38 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeINS_8BFloat16EfE5applyEf
172976: 0000000001852890    10 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIflE5applyEl
173038: 0000000001852f80     9 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeISt7complexIdEfE5applyEf
173329: 00000000018531c0    20 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIbfE5applyEf
173779: 00000000018524d0     3 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIhiE5applyEi
174032: 0000000001852960    14 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIfNS_8BFloat16EE5applyES1_
174334: 0000000001852d60    29 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeISt7complexIfEdE5applyEd
174470: 0000000001852c60   124 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIsNS_4HalfEE5applyES1_
174770: 0000000001852bc0    15 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIlNS_8BFloat16EE5applyES1_
176408: 0000000001853980   144 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeINS_4HalfEbE5applyEb
176475: 0000000001852790   128 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIdNS_4HalfEE5applyES1_
....
```

And after this PR, we get empty output
```
```
Pull Request resolved: pytorch#31279

Differential Revision: D19075587

Pulled By: ngimel

fbshipit-source-id: c20088241f39fa40c1d055f0a46eb5b9ece52e71
@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in 60ec53c.

mingbowan pushed a commit that referenced this pull request Dec 20, 2019
Summary:
Fixes #31271

This fixes copy kernel speed regression introduced in #29631.

The previous implementation forces the compiler to instantiate `static_cast_with_inter_type` because it is passed as an argument of a function. This behavior makes it impossible for compilers to do optimizations like automatic vectorization, and, function call itself is expensive compared to a single casting instruction.

To check the change, run
```
readelf -Ws /home/xgao/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so | grep static_cast_with_inter_type
```

On nightly build, we have output
```
168217: 0000000001852bf0     5 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIsdE5applyEd
168816: 0000000001852d30    33 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeISt7complexIfEaE5applyEa
168843: 00000000018531f0     7 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIblE5applyEl
168930: 0000000001852c20     3 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIslE5applyEl
168935: 00000000018528d0   124 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIfNS_4HalfEE5applyES1_
169023: 0000000001852f30    17 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeISt7complexIdEhE5applyEh
169713: 00000000018525c0     3 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIahE5applyEh
170033: 0000000001852c10     3 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIsiE5applyEi
170105: 0000000001852bd0     5 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIshE5applyEh
170980: 0000000001852fc0    27 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeISt7complexIdES1_IfEE5applyES3_
171398: 0000000001852810    13 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIdbE5applyEb
171574: 00000000018532e0    35 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIbNS_8BFloat16EE5applyES1_
171734: 0000000001852b20     6 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIlSt7complexIdEE5applyES2_
172422: 0000000001853350    54 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeINS_8BFloat16EaE5applyEa
172704: 00000000018533c0    38 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeINS_8BFloat16EfE5applyEf
172976: 0000000001852890    10 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIflE5applyEl
173038: 0000000001852f80     9 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeISt7complexIdEfE5applyEf
173329: 00000000018531c0    20 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIbfE5applyEf
173779: 00000000018524d0     3 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIhiE5applyEi
174032: 0000000001852960    14 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIfNS_8BFloat16EE5applyES1_
174334: 0000000001852d60    29 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeISt7complexIfEdE5applyEd
174470: 0000000001852c60   124 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIsNS_4HalfEE5applyES1_
174770: 0000000001852bc0    15 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIlNS_8BFloat16EE5applyES1_
176408: 0000000001853980   144 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeINS_4HalfEbE5applyEb
176475: 0000000001852790   128 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIdNS_4HalfEE5applyES1_
....
```

And after this PR, we get empty output
```
```
Pull Request resolved: #31279

Differential Revision: D19075587

Pulled By: ngimel

fbshipit-source-id: c20088241f39fa40c1d055f0a46eb5b9ece52e71
wuhuikx pushed a commit to wuhuikx/pytorch that referenced this pull request Jan 30, 2020
…#31279)

Summary:
Fixes pytorch#31271

This fixes copy kernel speed regression introduced in pytorch#29631.

The previous implementation forces the compiler to instantiate `static_cast_with_inter_type` because it is passed as an argument of a function. This behavior makes it impossible for compilers to do optimizations like automatic vectorization, and, function call itself is expensive compared to a single casting instruction.

To check the change, run
```
readelf -Ws /home/xgao/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so | grep static_cast_with_inter_type
```

On nightly build, we have output
```
168217: 0000000001852bf0     5 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIsdE5applyEd
168816: 0000000001852d30    33 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeISt7complexIfEaE5applyEa
168843: 00000000018531f0     7 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIblE5applyEl
168930: 0000000001852c20     3 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIslE5applyEl
168935: 00000000018528d0   124 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIfNS_4HalfEE5applyES1_
169023: 0000000001852f30    17 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeISt7complexIdEhE5applyEh
169713: 00000000018525c0     3 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIahE5applyEh
170033: 0000000001852c10     3 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIsiE5applyEi
170105: 0000000001852bd0     5 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIshE5applyEh
170980: 0000000001852fc0    27 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeISt7complexIdES1_IfEE5applyES3_
171398: 0000000001852810    13 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIdbE5applyEb
171574: 00000000018532e0    35 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIbNS_8BFloat16EE5applyES1_
171734: 0000000001852b20     6 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIlSt7complexIdEE5applyES2_
172422: 0000000001853350    54 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeINS_8BFloat16EaE5applyEa
172704: 00000000018533c0    38 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeINS_8BFloat16EfE5applyEf
172976: 0000000001852890    10 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIflE5applyEl
173038: 0000000001852f80     9 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeISt7complexIdEfE5applyEf
173329: 00000000018531c0    20 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIbfE5applyEf
173779: 00000000018524d0     3 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIhiE5applyEi
174032: 0000000001852960    14 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIfNS_8BFloat16EE5applyES1_
174334: 0000000001852d60    29 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeISt7complexIfEdE5applyEd
174470: 0000000001852c60   124 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIsNS_4HalfEE5applyES1_
174770: 0000000001852bc0    15 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIlNS_8BFloat16EE5applyES1_
176408: 0000000001853980   144 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeINS_4HalfEbE5applyEb
176475: 0000000001852790   128 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIdNS_4HalfEE5applyES1_
....
```

And after this PR, we get empty output
```
```
Pull Request resolved: pytorch#31279

Differential Revision: D19075587

Pulled By: ngimel

fbshipit-source-id: c20088241f39fa40c1d055f0a46eb5b9ece52e71
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance regression in CPU copy kernel since 1.3.1
5 participants