Skip to content

Reduce overhead when Future invokes callbacks inline #57638

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

lw
Copy link
Contributor

@lw lw commented May 5, 2021

Stack from ghstack:

In RPC there are a few instances of "fastpaths" which do if (fut.isCompleted()) { do_sth(); } else { fut.addCallback(do_sth); }. I intend to get rid of them, for reasons I'll clarify later but which in a nutshell have to do with CUDA correctness and readability. Note that dropping the fastpath introduces no change in behavior (because addCallback invokes the callback inline anyways), thus the only perf concern comes from the fact that the fastpath avoids constructing and passing around a std::function. I don't think this is a significant performance hit. Regardless, this PR preemptively addresses this concern, by tweaking addCallback (and similar methods) so they can handle raw lambdas, and so that they do not wrap them into std::functions if they are invoked inline. In other words, if the compiler were to inline this new version of addCallback it would obtain the exact same code as that explicit fastpath.

Differential Revision: D28222808

In RPC there are a few instances of "fastpaths" which do `if (fut.isCompleted()) { do_sth(); } else { fut.addCallback(do_sth); }`. I intend to get rid of them, for reasons I'll clarify later but which in a nutshell have to do with CUDA correctness and readability. Note that dropping the fastpath introduces no change in behavior (because `addCallback` invokes the callback inline anyways), thus the only perf concern comes from the fact that the fastpath avoids constructing and passing around a `std::function`. I don't think this is a significant performance hit. Regardless, this PR preemptively addresses this concern, by tweaking `addCallback` (and similar methods) so they can handle raw lambdas, and so that they do _not_ wrap them into `std::function`s if they are invoked inline.

Differential Revision: [D28222808](https://our.internmc.facebook.com/intern/diff/D28222808/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented May 5, 2021

💊 CI failures summary and remediations

As of commit 4bab8db (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (1/1)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

May 21 12:18:12 torch_xla/csrc/aten_xla_type.cp... match any declaration in 'torch_xla::AtenXlaType'
May 21 12:17:58 clang-9 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/aten_xla_type_default.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/aten_xla_type_default.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
May 21 12:18:00 clang-9 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/aten_xla_type.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/aten_xla_type.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
May 21 12:18:04 clang-9 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/batch_norm.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/batch_norm.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
May 21 12:18:11 clang-9 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/reduction.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/reduction.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
May 21 12:18:12 torch_xla/csrc/aten_xla_type.cpp:1238:25: error: out-of-line definition of 'div' does not match any declaration in 'torch_xla::AtenXlaType'
May 21 12:18:12 at::Tensor AtenXlaType::div(const at::Tensor& self, const at::Tensor& other,
May 21 12:18:12                         ^~~
May 21 12:18:12 /var/lib/jenkins/workspace/xla/torch_xla/csrc/aten_xla_type.h:35:74: note: type of 3rd parameter of member declaration does not match definition ('optional<c10::string_view>' vs 'optional<std::string>')
May 21 12:18:12 static at::Tensor div(const at::Tensor & self, const at::Tensor & other, c10::optional<c10::string_view> rounding_mode);
May 21 12:18:12                                                                          ^
May 21 12:18:12 torch_xla/csrc/aten_xla_type.cpp:1257:26: error: out-of-line definition of 'div_' does not match any declaration in 'torch_xla::AtenXlaType'
May 21 12:18:12 at::Tensor& AtenXlaType::div_(at::Tensor& self, const at::Tensor& other,
May 21 12:18:12                          ^~~~
May 21 12:18:12 /var/lib/jenkins/workspace/xla/torch_xla/csrc/aten_xla_type.h:84:71: note: type of 3rd parameter of member declaration does not match definition ('optional<c10::string_view>' vs 'optional<std::string>')
May 21 12:18:12 static at::Tensor & div_(at::Tensor & self, const at::Tensor & other, c10::optional<c10::string_view> rounding_mode);
May 21 12:18:12                                                                       ^
May 21 12:18:15 2 errors generated.
May 21 12:18:15 clang-9 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/matrix.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/matrix.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
May 21 12:18:23 clang-9 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/pooling.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/pooling.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
May 21 12:18:23 1 warning generated.
May 21 12:18:23 clang-9 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/aten_xla_bridge.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/aten_xla_bridge.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1

XLA failure

Job pytorch_xla_linux_bionic_py3_6_clang9_build is failing. Please create an issue with title prefixed by [PT_BREAK] in pytorch/xla and link to to this PR. If you have questions, please reach out to @ailzhang / @dlibenzi / @JackCaoG.


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

lw added a commit that referenced this pull request May 5, 2021
In RPC there are a few instances of "fastpaths" which do `if (fut.isCompleted()) { do_sth(); } else { fut.addCallback(do_sth); }`. I intend to get rid of them, for reasons I'll clarify later but which in a nutshell have to do with CUDA correctness and readability. Note that dropping the fastpath introduces no change in behavior (because `addCallback` invokes the callback inline anyways), thus the only perf concern comes from the fact that the fastpath avoids constructing and passing around a `std::function`. I don't think this is a significant performance hit. Regardless, this PR preemptively addresses this concern, by tweaking `addCallback` (and similar methods) so they can handle raw lambdas, and so that they do _not_ wrap them into `std::function`s if they are invoked inline.

Differential Revision: [D28222808](https://our.internmc.facebook.com/intern/diff/D28222808/)

ghstack-source-id: 128188696
Pull Request resolved: #57638
In RPC there are a few instances of "fastpaths" which do `if (fut.isCompleted()) { do_sth(); } else { fut.addCallback(do_sth); }`. I intend to get rid of them, for reasons I'll clarify later but which in a nutshell have to do with CUDA correctness and readability. Note that dropping the fastpath introduces no change in behavior (because `addCallback` invokes the callback inline anyways), thus the only perf concern comes from the fact that the fastpath avoids constructing and passing around a `std::function`. I don't think this is a significant performance hit. Regardless, this PR preemptively addresses this concern, by tweaking `addCallback` (and similar methods) so they can handle raw lambdas, and so that they do _not_ wrap them into `std::function`s if they are invoked inline. In other words, if the compiler were to inline this new version of `addCallback` it would obtain the _exact_ same code as that explicit fastpath.

Differential Revision: [D28222808](https://our.internmc.facebook.com/intern/diff/D28222808/)

[ghstack-poisoned]
lw added a commit that referenced this pull request May 6, 2021
Pull Request resolved: #57638

In RPC there are a few instances of "fastpaths" which do `if (fut.isCompleted()) { do_sth(); } else { fut.addCallback(do_sth); }`. I intend to get rid of them, for reasons I'll clarify later but which in a nutshell have to do with CUDA correctness and readability. Note that dropping the fastpath introduces no change in behavior (because `addCallback` invokes the callback inline anyways), thus the only perf concern comes from the fact that the fastpath avoids constructing and passing around a `std::function`. I don't think this is a significant performance hit. Regardless, this PR preemptively addresses this concern, by tweaking `addCallback` (and similar methods) so they can handle raw lambdas, and so that they do _not_ wrap them into `std::function`s if they are invoked inline.
ghstack-source-id: 128297741

Differential Revision: [D28222808](https://our.internmc.facebook.com/intern/diff/D28222808/)
In RPC there are a few instances of "fastpaths" which do `if (fut.isCompleted()) { do_sth(); } else { fut.addCallback(do_sth); }`. I intend to get rid of them, for reasons I'll clarify later but which in a nutshell have to do with CUDA correctness and readability. Note that dropping the fastpath introduces no change in behavior (because `addCallback` invokes the callback inline anyways), thus the only perf concern comes from the fact that the fastpath avoids constructing and passing around a `std::function`. I don't think this is a significant performance hit. Regardless, this PR preemptively addresses this concern, by tweaking `addCallback` (and similar methods) so they can handle raw lambdas, and so that they do _not_ wrap them into `std::function`s if they are invoked inline. In other words, if the compiler were to inline this new version of `addCallback` it would obtain the _exact_ same code as that explicit fastpath.

Differential Revision: [D28222808](https://our.internmc.facebook.com/intern/diff/D28222808/)

[ghstack-poisoned]
lw added 2 commits May 17, 2021 03:53
In RPC there are a few instances of "fastpaths" which do `if (fut.isCompleted()) { do_sth(); } else { fut.addCallback(do_sth); }`. I intend to get rid of them, for reasons I'll clarify later but which in a nutshell have to do with CUDA correctness and readability. Note that dropping the fastpath introduces no change in behavior (because `addCallback` invokes the callback inline anyways), thus the only perf concern comes from the fact that the fastpath avoids constructing and passing around a `std::function`. I don't think this is a significant performance hit. Regardless, this PR preemptively addresses this concern, by tweaking `addCallback` (and similar methods) so they can handle raw lambdas, and so that they do _not_ wrap them into `std::function`s if they are invoked inline. In other words, if the compiler were to inline this new version of `addCallback` it would obtain the _exact_ same code as that explicit fastpath.

Differential Revision: [D28222808](https://our.internmc.facebook.com/intern/diff/D28222808/)

[ghstack-poisoned]
In RPC there are a few instances of "fastpaths" which do `if (fut.isCompleted()) { do_sth(); } else { fut.addCallback(do_sth); }`. I intend to get rid of them, for reasons I'll clarify later but which in a nutshell have to do with CUDA correctness and readability. Note that dropping the fastpath introduces no change in behavior (because `addCallback` invokes the callback inline anyways), thus the only perf concern comes from the fact that the fastpath avoids constructing and passing around a `std::function`. I don't think this is a significant performance hit. Regardless, this PR preemptively addresses this concern, by tweaking `addCallback` (and similar methods) so they can handle raw lambdas, and so that they do _not_ wrap them into `std::function`s if they are invoked inline. In other words, if the compiler were to inline this new version of `addCallback` it would obtain the _exact_ same code as that explicit fastpath.

Differential Revision: [D28222808](https://our.internmc.facebook.com/intern/diff/D28222808/)

[ghstack-poisoned]
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(can be done in followup PRs): shall we add some test to verify that the new code throws correct error when the signature is not expected?

@lw
Copy link
Contributor Author

lw commented May 18, 2021

shall we add some test to verify that the new code throws correct error when the signature is not expected?

It's a static assert, hence an incorrect signature will cause a compile error. I don't know how we can test that?

lw added 3 commits May 18, 2021 02:35
In RPC there are a few instances of "fastpaths" which do `if (fut.isCompleted()) { do_sth(); } else { fut.addCallback(do_sth); }`. I intend to get rid of them, for reasons I'll clarify later but which in a nutshell have to do with CUDA correctness and readability. Note that dropping the fastpath introduces no change in behavior (because `addCallback` invokes the callback inline anyways), thus the only perf concern comes from the fact that the fastpath avoids constructing and passing around a `std::function`. I don't think this is a significant performance hit. Regardless, this PR preemptively addresses this concern, by tweaking `addCallback` (and similar methods) so they can handle raw lambdas, and so that they do _not_ wrap them into `std::function`s if they are invoked inline. In other words, if the compiler were to inline this new version of `addCallback` it would obtain the _exact_ same code as that explicit fastpath.

Differential Revision: [D28222808](https://our.internmc.facebook.com/intern/diff/D28222808/)

[ghstack-poisoned]
In RPC there are a few instances of "fastpaths" which do `if (fut.isCompleted()) { do_sth(); } else { fut.addCallback(do_sth); }`. I intend to get rid of them, for reasons I'll clarify later but which in a nutshell have to do with CUDA correctness and readability. Note that dropping the fastpath introduces no change in behavior (because `addCallback` invokes the callback inline anyways), thus the only perf concern comes from the fact that the fastpath avoids constructing and passing around a `std::function`. I don't think this is a significant performance hit. Regardless, this PR preemptively addresses this concern, by tweaking `addCallback` (and similar methods) so they can handle raw lambdas, and so that they do _not_ wrap them into `std::function`s if they are invoked inline. In other words, if the compiler were to inline this new version of `addCallback` it would obtain the _exact_ same code as that explicit fastpath.

Differential Revision: [D28222808](https://our.internmc.facebook.com/intern/diff/D28222808/)

[ghstack-poisoned]
In RPC there are a few instances of "fastpaths" which do `if (fut.isCompleted()) { do_sth(); } else { fut.addCallback(do_sth); }`. I intend to get rid of them, for reasons I'll clarify later but which in a nutshell have to do with CUDA correctness and readability. Note that dropping the fastpath introduces no change in behavior (because `addCallback` invokes the callback inline anyways), thus the only perf concern comes from the fact that the fastpath avoids constructing and passing around a `std::function`. I don't think this is a significant performance hit. Regardless, this PR preemptively addresses this concern, by tweaking `addCallback` (and similar methods) so they can handle raw lambdas, and so that they do _not_ wrap them into `std::function`s if they are invoked inline. In other words, if the compiler were to inline this new version of `addCallback` it would obtain the _exact_ same code as that explicit fastpath.

Differential Revision: [D28222808](https://our.internmc.facebook.com/intern/diff/D28222808/)

[ghstack-poisoned]
In RPC there are a few instances of "fastpaths" which do `if (fut.isCompleted()) { do_sth(); } else { fut.addCallback(do_sth); }`. I intend to get rid of them, for reasons I'll clarify later but which in a nutshell have to do with CUDA correctness and readability. Note that dropping the fastpath introduces no change in behavior (because `addCallback` invokes the callback inline anyways), thus the only perf concern comes from the fact that the fastpath avoids constructing and passing around a `std::function`. I don't think this is a significant performance hit. Regardless, this PR preemptively addresses this concern, by tweaking `addCallback` (and similar methods) so they can handle raw lambdas, and so that they do _not_ wrap them into `std::function`s if they are invoked inline. In other words, if the compiler were to inline this new version of `addCallback` it would obtain the _exact_ same code as that explicit fastpath.

Differential Revision: [D28222808](https://our.internmc.facebook.com/intern/diff/D28222808/)

[ghstack-poisoned]
dgl-intel pushed a commit to dgl-intel/pytorch that referenced this pull request May 21, 2021
Pull Request resolved: pytorch#57638

In RPC there are a few instances of "fastpaths" which do `if (fut.isCompleted()) { do_sth(); } else { fut.addCallback(do_sth); }`. I intend to get rid of them, for reasons I'll clarify later but which in a nutshell have to do with CUDA correctness and readability. Note that dropping the fastpath introduces no change in behavior (because `addCallback` invokes the callback inline anyways), thus the only perf concern comes from the fact that the fastpath avoids constructing and passing around a `std::function`. I don't think this is a significant performance hit. Regardless, this PR preemptively addresses this concern, by tweaking `addCallback` (and similar methods) so they can handle raw lambdas, and so that they do _not_ wrap them into `std::function`s if they are invoked inline.
ghstack-source-id: 129567067

Differential Revision: [D28222808](https://our.internmc.facebook.com/intern/diff/D28222808/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 1d7cf4b.

@facebook-github-bot facebook-github-bot deleted the gh/lw/158/head branch May 25, 2021 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants