Don't use RTLD_GLOBAL to load _C. #31162

ezyang · 2019-12-12T02:49:31Z

Stack from ghstack:

Don't use RTLD_GLOBAL to load _C. #31162 Don't use RTLD_GLOBAL to load _C.
Uniformly apply Windows logic in cpp_extensions everywhere #31161 Uniformly apply Windows logic in cpp_extensions everywhere

This should help us resolve a multitude of weird segfaults and crashes
when PyTorch is imported along with other packages. Those would often
happen because libtorch symbols were exposed globally and could be used
as a source of relocations in shared libraries loaded after libtorch.

Based off of apaszke's original work at #28536

Fixes #3059.

Some of the subtleties in preparing this patch:

Getting ASAN to play ball was a pain in the ass. The basic problem is that when we load with RTLD_LOCAL, we now may load a library multiple times into the address space; this happens when we have custom C++ extensions. Since the libraries are usually identical, this is usually benign, but it is technically undefined behavior and UBSAN hates it. I sprayed a few ways of getting things to "work" correctly: I preload libstdc++ (so that it is seen consistently over all library loads) and added turned off vptr checks entirely. Another possibility is we should have a mode where we use RTLD_GLOBAL to load _C, which would be acceptable in environments where you're sure C++ lines up correctly. There's a long comment in the test script going into more detail about this.
Making some of our shared library dependencies load with RTLD_LOCAL breaks them. OpenMPI and MKL don't work; they play linker shenanigans to look up their symbols which doesn't work when loaded locally, and if we load a library with RLTD_LOCAL we aren't able to subsequently see it with ctypes. To solve this problem, we employ a clever device invented by apaszke: we create a dummy library torch_global_deps with dependencies on all of the libraries which need to be loaded globally, and then load that with RTLD_GLOBAL. As long as none of these libraries have C++ symbols, we can avoid confusion about C++ standard library.

Signed-off-by: Edward Z. Yang ezyang@fb.com

Differential Revision: D19262579

Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]

kostmo · 2019-12-12T03:30:54Z

💊 CircleCI build failures summary and remediations

As of commit 2771d72:

None of the build failures appear to be your fault.

1/1 broken upstream at merge base 3c7db5c since Jan 07
You may want to rebase on the viable/strict branch (expand for instructions)

If your commit is newer than viable/strict, you can try basing on an older, stable commit:
```
git fetch origin viable/strict
git rebase --onto viable/strict $(git merge-base origin/master HEAD)
```
If your commit is older than viable/strict:
```
git fetch origin viable/strict
git rebase viable/strict
```
Check out the recency history of this "viable master" tracking branch.

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

1 failure not recognized by patterns:

Job	Step	Status
^{binary_macos_libtorch_2_7_cpu_build}	^Build	🛑 Broken upstream

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 77 times.

Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]

Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: 8f6e17418ca8503cb17b40ee5f5212221bacf2ff Pull Request resolved: #31162

Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]

Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: 764d6138f7e3d113677e65cc06b6c103e3ec1c55 Pull Request resolved: #31162

Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]

Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: 948c142f04846261b8840060d9522112a45e8156 Pull Request resolved: #31162

albanD · 2020-01-07T15:02:16Z

.jenkins/pytorch/test.sh

+    # especially applies to type info, which is almost always weak.  This
+    # has implications for RTTI (which UBSAN is rightly flagging won't
+    # work), but in our codebase, we don't use RTTI (because it doesn't
+    # work in mobile).  However, UBSAN relies on UBSAN to detect vptr


nit: UBSAN relies on UBSAN ?

This should help us resolve a multitude of weird segfaults and crashes when PyTorch is imported along with other packages. Those would often happen because libtorch symbols were exposed globally and could be used as a source of relocations in shared libraries loaded after libtorch. Fixes #3059. Some of the subtleties in preparing this patch: * Getting ASAN to play ball was a pain in the ass. The basic problem is that when we load with `RTLD_LOCAL`, we now may load a library multiple times into the address space; this happens when we have custom C++ extensions. Since the libraries are usually identical, this is usually benign, but it is technically undefined behavior and UBSAN hates it. I sprayed a few ways of getting things to "work" correctly: I preload libstdc++ (so that it is seen consistently over all library loads) and added turned off vptr checks entirely. Another possibility is we should have a mode where we use RTLD_GLOBAL to load _C, which would be acceptable in environments where you're sure C++ lines up correctly. There's a long comment in the test script going into more detail about this. * Making some of our shared library dependencies load with `RTLD_LOCAL` breaks them. OpenMPI and MKL don't work; they play linker shenanigans to look up their symbols which doesn't work when loaded locally, and if we load a library with `RLTD_LOCAL` we aren't able to subsequently see it with `ctypes`. To solve this problem, we employ a clever device invented by apaszke: we create a dummy library `torch_global_deps` with dependencies on all of the libraries which need to be loaded globally, and then load that with `RTLD_GLOBAL`. As long as none of these libraries have C++ symbols, we can avoid confusion about C++ standard library. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D19262579](https://our.internmc.facebook.com/intern/diff/D19262579) [ghstack-poisoned]

Pull Request resolved: #31162 This should help us resolve a multitude of weird segfaults and crashes when PyTorch is imported along with other packages. Those would often happen because libtorch symbols were exposed globally and could be used as a source of relocations in shared libraries loaded after libtorch. Fixes #3059. Some of the subtleties in preparing this patch: * Getting ASAN to play ball was a pain in the ass. The basic problem is that when we load with `RTLD_LOCAL`, we now may load a library multiple times into the address space; this happens when we have custom C++ extensions. Since the libraries are usually identical, this is usually benign, but it is technically undefined behavior and UBSAN hates it. I sprayed a few ways of getting things to "work" correctly: I preload libstdc++ (so that it is seen consistently over all library loads) and added turned off vptr checks entirely. Another possibility is we should have a mode where we use RTLD_GLOBAL to load _C, which would be acceptable in environments where you're sure C++ lines up correctly. There's a long comment in the test script going into more detail about this. * Making some of our shared library dependencies load with `RTLD_LOCAL` breaks them. OpenMPI and MKL don't work; they play linker shenanigans to look up their symbols which doesn't work when loaded locally, and if we load a library with `RLTD_LOCAL` we aren't able to subsequently see it with `ctypes`. To solve this problem, we employ a clever device invented by apaszke: we create a dummy library `torch_global_deps` with dependencies on all of the libraries which need to be loaded globally, and then load that with `RTLD_GLOBAL`. As long as none of these libraries have C++ symbols, we can avoid confusion about C++ standard library. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D19262579](https://our.internmc.facebook.com/intern/diff/D19262579/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19262579/)! ghstack-source-id: 96370605

torch/__init__.py

peterjc123 · 2020-01-08T14:04:07Z

torch/__init__.py

+
+# See Note [Global dependencies]
+def _load_global_deps():
+    if platform.system() == 'Windows':


What about defining a global variable IS_WINDOWS since it's used multiple times in this file?

I'll do this in a follow up

This should help us resolve a multitude of weird segfaults and crashes when PyTorch is imported along with other packages. Those would often happen because libtorch symbols were exposed globally and could be used as a source of relocations in shared libraries loaded after libtorch. Fixes #3059. Some of the subtleties in preparing this patch: * Getting ASAN to play ball was a pain in the ass. The basic problem is that when we load with `RTLD_LOCAL`, we now may load a library multiple times into the address space; this happens when we have custom C++ extensions. Since the libraries are usually identical, this is usually benign, but it is technically undefined behavior and UBSAN hates it. I sprayed a few ways of getting things to "work" correctly: I preload libstdc++ (so that it is seen consistently over all library loads) and added turned off vptr checks entirely. Another possibility is we should have a mode where we use RTLD_GLOBAL to load _C, which would be acceptable in environments where you're sure C++ lines up correctly. There's a long comment in the test script going into more detail about this. * Making some of our shared library dependencies load with `RTLD_LOCAL` breaks them. OpenMPI and MKL don't work; they play linker shenanigans to look up their symbols which doesn't work when loaded locally, and if we load a library with `RLTD_LOCAL` we aren't able to subsequently see it with `ctypes`. To solve this problem, we employ a clever device invented by apaszke: we create a dummy library `torch_global_deps` with dependencies on all of the libraries which need to be loaded globally, and then load that with `RTLD_GLOBAL`. As long as none of these libraries have C++ symbols, we can avoid confusion about C++ standard library. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D19262579](https://our.internmc.facebook.com/intern/diff/D19262579) [ghstack-poisoned]

Pull Request resolved: #31162 This should help us resolve a multitude of weird segfaults and crashes when PyTorch is imported along with other packages. Those would often happen because libtorch symbols were exposed globally and could be used as a source of relocations in shared libraries loaded after libtorch. Fixes #3059. Some of the subtleties in preparing this patch: * Getting ASAN to play ball was a pain in the ass. The basic problem is that when we load with `RTLD_LOCAL`, we now may load a library multiple times into the address space; this happens when we have custom C++ extensions. Since the libraries are usually identical, this is usually benign, but it is technically undefined behavior and UBSAN hates it. I sprayed a few ways of getting things to "work" correctly: I preload libstdc++ (so that it is seen consistently over all library loads) and added turned off vptr checks entirely. Another possibility is we should have a mode where we use RTLD_GLOBAL to load _C, which would be acceptable in environments where you're sure C++ lines up correctly. There's a long comment in the test script going into more detail about this. * Making some of our shared library dependencies load with `RTLD_LOCAL` breaks them. OpenMPI and MKL don't work; they play linker shenanigans to look up their symbols which doesn't work when loaded locally, and if we load a library with `RLTD_LOCAL` we aren't able to subsequently see it with `ctypes`. To solve this problem, we employ a clever device invented by apaszke: we create a dummy library `torch_global_deps` with dependencies on all of the libraries which need to be loaded globally, and then load that with `RTLD_GLOBAL`. As long as none of these libraries have C++ symbols, we can avoid confusion about C++ standard library. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D19262579](https://our.internmc.facebook.com/intern/diff/D19262579/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19262579/)! ghstack-source-id: 0f4abce4f757daadff64a567465c0cd9bdce83a3

This should help us resolve a multitude of weird segfaults and crashes when PyTorch is imported along with other packages. Those would often happen because libtorch symbols were exposed globally and could be used as a source of relocations in shared libraries loaded after libtorch. Fixes #3059. Some of the subtleties in preparing this patch: * Getting ASAN to play ball was a pain in the ass. The basic problem is that when we load with `RTLD_LOCAL`, we now may load a library multiple times into the address space; this happens when we have custom C++ extensions. Since the libraries are usually identical, this is usually benign, but it is technically undefined behavior and UBSAN hates it. I sprayed a few ways of getting things to "work" correctly: I preload libstdc++ (so that it is seen consistently over all library loads) and added turned off vptr checks entirely. Another possibility is we should have a mode where we use RTLD_GLOBAL to load _C, which would be acceptable in environments where you're sure C++ lines up correctly. There's a long comment in the test script going into more detail about this. * Making some of our shared library dependencies load with `RTLD_LOCAL` breaks them. OpenMPI and MKL don't work; they play linker shenanigans to look up their symbols which doesn't work when loaded locally, and if we load a library with `RLTD_LOCAL` we aren't able to subsequently see it with `ctypes`. To solve this problem, we employ a clever device invented by apaszke: we create a dummy library `torch_global_deps` with dependencies on all of the libraries which need to be loaded globally, and then load that with `RTLD_GLOBAL`. As long as none of these libraries have C++ symbols, we can avoid confusion about C++ standard library. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D19262579](https://our.internmc.facebook.com/intern/diff/D19262579) [ghstack-poisoned]

Pull Request resolved: #31162 This should help us resolve a multitude of weird segfaults and crashes when PyTorch is imported along with other packages. Those would often happen because libtorch symbols were exposed globally and could be used as a source of relocations in shared libraries loaded after libtorch. Fixes #3059. Some of the subtleties in preparing this patch: * Getting ASAN to play ball was a pain in the ass. The basic problem is that when we load with `RTLD_LOCAL`, we now may load a library multiple times into the address space; this happens when we have custom C++ extensions. Since the libraries are usually identical, this is usually benign, but it is technically undefined behavior and UBSAN hates it. I sprayed a few ways of getting things to "work" correctly: I preload libstdc++ (so that it is seen consistently over all library loads) and added turned off vptr checks entirely. Another possibility is we should have a mode where we use RTLD_GLOBAL to load _C, which would be acceptable in environments where you're sure C++ lines up correctly. There's a long comment in the test script going into more detail about this. * Making some of our shared library dependencies load with `RTLD_LOCAL` breaks them. OpenMPI and MKL don't work; they play linker shenanigans to look up their symbols which doesn't work when loaded locally, and if we load a library with `RLTD_LOCAL` we aren't able to subsequently see it with `ctypes`. To solve this problem, we employ a clever device invented by apaszke: we create a dummy library `torch_global_deps` with dependencies on all of the libraries which need to be loaded globally, and then load that with `RTLD_GLOBAL`. As long as none of these libraries have C++ symbols, we can avoid confusion about C++ standard library. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D19262579](https://our.internmc.facebook.com/intern/diff/D19262579/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19262579/)! ghstack-source-id: c3991186af0f3ceca879e6d57b8c13431694b792

This should help us resolve a multitude of weird segfaults and crashes when PyTorch is imported along with other packages. Those would often happen because libtorch symbols were exposed globally and could be used as a source of relocations in shared libraries loaded after libtorch. Fixes #3059. Some of the subtleties in preparing this patch: * Getting ASAN to play ball was a pain in the ass. The basic problem is that when we load with `RTLD_LOCAL`, we now may load a library multiple times into the address space; this happens when we have custom C++ extensions. Since the libraries are usually identical, this is usually benign, but it is technically undefined behavior and UBSAN hates it. I sprayed a few ways of getting things to "work" correctly: I preload libstdc++ (so that it is seen consistently over all library loads) and added turned off vptr checks entirely. Another possibility is we should have a mode where we use RTLD_GLOBAL to load _C, which would be acceptable in environments where you're sure C++ lines up correctly. There's a long comment in the test script going into more detail about this. * Making some of our shared library dependencies load with `RTLD_LOCAL` breaks them. OpenMPI and MKL don't work; they play linker shenanigans to look up their symbols which doesn't work when loaded locally, and if we load a library with `RLTD_LOCAL` we aren't able to subsequently see it with `ctypes`. To solve this problem, we employ a clever device invented by apaszke: we create a dummy library `torch_global_deps` with dependencies on all of the libraries which need to be loaded globally, and then load that with `RTLD_GLOBAL`. As long as none of these libraries have C++ symbols, we can avoid confusion about C++ standard library. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D19262579](https://our.internmc.facebook.com/intern/diff/D19262579) [ghstack-poisoned]

Pull Request resolved: #31162 This should help us resolve a multitude of weird segfaults and crashes when PyTorch is imported along with other packages. Those would often happen because libtorch symbols were exposed globally and could be used as a source of relocations in shared libraries loaded after libtorch. Fixes #3059. Some of the subtleties in preparing this patch: * Getting ASAN to play ball was a pain in the ass. The basic problem is that when we load with `RTLD_LOCAL`, we now may load a library multiple times into the address space; this happens when we have custom C++ extensions. Since the libraries are usually identical, this is usually benign, but it is technically undefined behavior and UBSAN hates it. I sprayed a few ways of getting things to "work" correctly: I preload libstdc++ (so that it is seen consistently over all library loads) and added turned off vptr checks entirely. Another possibility is we should have a mode where we use RTLD_GLOBAL to load _C, which would be acceptable in environments where you're sure C++ lines up correctly. There's a long comment in the test script going into more detail about this. * Making some of our shared library dependencies load with `RTLD_LOCAL` breaks them. OpenMPI and MKL don't work; they play linker shenanigans to look up their symbols which doesn't work when loaded locally, and if we load a library with `RLTD_LOCAL` we aren't able to subsequently see it with `ctypes`. To solve this problem, we employ a clever device invented by apaszke: we create a dummy library `torch_global_deps` with dependencies on all of the libraries which need to be loaded globally, and then load that with `RTLD_GLOBAL`. As long as none of these libraries have C++ symbols, we can avoid confusion about C++ standard library. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D19262579](https://our.internmc.facebook.com/intern/diff/D19262579/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19262579/)! ghstack-source-id: d112b7f68f612d8ada03ab99a7669333934987d1

facebook-github-bot · 2020-01-09T17:13:25Z

@ezyang merged this pull request in ddff4ef.

Summary: Fixes #31181 and #31162 (comment). Pull Request resolved: #32215 Differential Revision: D19501869 Pulled By: ezyang fbshipit-source-id: 363824e52d2592ad968ecf1df345aa4c0daff915

Summary: Pull Request resolved: pytorch#31162 This should help us resolve a multitude of weird segfaults and crashes when PyTorch is imported along with other packages. Those would often happen because libtorch symbols were exposed globally and could be used as a source of relocations in shared libraries loaded after libtorch. Fixes pytorch#3059. Some of the subtleties in preparing this patch: * Getting ASAN to play ball was a pain in the ass. The basic problem is that when we load with `RTLD_LOCAL`, we now may load a library multiple times into the address space; this happens when we have custom C++ extensions. Since the libraries are usually identical, this is usually benign, but it is technically undefined behavior and UBSAN hates it. I sprayed a few ways of getting things to "work" correctly: I preload libstdc++ (so that it is seen consistently over all library loads) and added turned off vptr checks entirely. Another possibility is we should have a mode where we use RTLD_GLOBAL to load _C, which would be acceptable in environments where you're sure C++ lines up correctly. There's a long comment in the test script going into more detail about this. * Making some of our shared library dependencies load with `RTLD_LOCAL` breaks them. OpenMPI and MKL don't work; they play linker shenanigans to look up their symbols which doesn't work when loaded locally, and if we load a library with `RLTD_LOCAL` we aren't able to subsequently see it with `ctypes`. To solve this problem, we employ a clever device invented by apaszke: we create a dummy library `torch_global_deps` with dependencies on all of the libraries which need to be loaded globally, and then load that with `RTLD_GLOBAL`. As long as none of these libraries have C++ symbols, we can avoid confusion about C++ standard library. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D19262579 Test Plan: Imported from OSS Pulled By: ezyang fbshipit-source-id: 06a48a5d2c9036aacd535f7e8a4de0e8fe1639f2

Summary: Fixes pytorch#31181 and pytorch#31162 (comment). Pull Request resolved: pytorch#32215 Differential Revision: D19501869 Pulled By: ezyang fbshipit-source-id: 363824e52d2592ad968ecf1df345aa4c0daff915

Summary: This is another implementation of the maximum bailout depth. The first version was implemented in https://github.com/pytorch/pytorch/pull/31521 This one has advantages that * the bailout depth only exists in `CodeImpl` which seems to be an appropriate place to keep it in. * threading many objects is reduced to threading through CodeImpl and getPlanFor Pull Request resolved: https://github.com/pytorch/pytorch/pull/32073 Differential Revision: D19443432 Pulled By: Krovatkin fbshipit-source-id: 898384bb2308a1532a50a33d9e05cfca504711e6 use gtest asserts in ProcessGroupGlooTest instead of other checks (#32138) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32138 I personally prefer `throw std::runtime_error("BOOM")`, but we should probably have asserts here now that it is gtest. Also ensures that the correct exceptions are thrown by the `testSignal` tests. ghstack-source-id: 96811000 Differential Revision: D19382905 fbshipit-source-id: 1b00dd70524d03c8bd6f48715baa5070a7985467 Don't dispatch to integral types in smooth_l1_kernel Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32333 Differential Revision: D19442787 Pulled By: ngimel fbshipit-source-id: 9578483202614d7406eceb13cbf15b253c04f237 Added cummin Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32238 Differential Revision: D19416791 Pulled By: anjali411 fbshipit-source-id: 5aadc0a7a55af40d76f444ab7d7d47ec822f55a5 Use default scale/zero_point in fake_quantize module instead of None (#32318) Summary: Distributed data parallel can not broadcast None so when we prepare the model for QAT and trying to save the model it will error out. fixes: https://github.com/pytorch/pytorch/issues/32082 Pull Request resolved: https://github.com/pytorch/pytorch/pull/32318 Differential Revision: D19434801 Pulled By: jerryzh168 fbshipit-source-id: ee70abe4c3dcdd3506fb7dd0316aee2fb1705469 Delete unused bernoulli_Tensor from THTensorRandom.h Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32328 Test Plan: Imported from OSS Differential Revision: D19448736 Pulled By: pbelevich fbshipit-source-id: 92380ca1e0c0ac88d100e6fba8d216a46d0b181e Add a new job to support custom build (#32323) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32323 Since we have released the custom build in 1.4.0, it's time to setup a CI for that. This PR adds a new iOS job to the iOS builds. To save time, It only runs the arm64 build. - Don't break any iOS jobs - Custom Build works. Test Plan: Imported from OSS Differential Revision: D19451342 Pulled By: xta0 fbshipit-source-id: 9de305c004fc795710ecf01d436ef4792c07760c Add 64bit atomic fetch add (#32354) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32354 adding int_64 version of AtomicFetchAdd Reviewed By: bwasti Differential Revision: D19434349 fbshipit-source-id: b2358e8c5c6b7cd7e7b21de974b4ee1b5258fcf4 Fix ASAN / potential segfault in quantized Tensor memory allocations. Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29882 Differential Revision: D18522039 Pulled By: AshkanAliabadi fbshipit-source-id: 1fdc68491aa2ac176633b9ecc3ee78c9175a97aa C++ C2/Glow operator unittest Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32258 Test Plan: ``` buck test glow/fb/test/numerics:fp16_op_test ``` Reviewed By: bddppq Differential Revision: D19401786 fbshipit-source-id: 1382b5208be6172d3e6f768dedad7ebec31cffc9 fix unchecked cast alias analysis (#32309) Summary: Unchecked cast just refines the type of a value, the value stays the same, so the output should alias the input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32309 Differential Revision: D19439037 Pulled By: eellison fbshipit-source-id: fe6902d0d9a5a9ef5e9c13e1dbd056576d8c327e exposing CPU/GPU Copy ops (#32248) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32248 expose CPU/GPU copy ops Test Plan: buck test mode/dev-nosan caffe2/caffe2/python/operator_test:torch_integration_test Reviewed By: houseroad Differential Revision: D19405856 fbshipit-source-id: 1df4aa202e26647cb81e9fe7e4478e594a5f7f3e Updating submodules Summary: GitHub commits: https://github.com/facebook/fb303/commit/29aba0a28715b89ef60c338ffa1db574e60fdf35 https://github.com/facebook/fbthrift/commit/37a97eb4de2596310339fcc1520c7e5dada37ab5 https://github.com/facebook/fbzmq/commit/0efdd5729236427074842bb91c9b4687e6721a69 https://github.com/facebook/folly/commit/6d886fc7ebe4a7cb55c7733f5d0ec2d85e7062bb https://github.com/facebook/proxygen/commit/2e5854752afb8068fc0fbc6b736790260167d56d https://github.com/facebook/wangle/commit/931d1c643bf4fa57fcdb3ca695ae643b39066476 https://github.com/facebookincubator/fizz/commit/781986ef716d85c66584612d2d1e261772f85699 https://github.com/facebookincubator/katran/commit/2e6d2903d7cfec77b7d2f878f2add87e354352f1 https://github.com/facebookincubator/mvfst/commit/e04348ff63f56ff791336ecfd037193f1bd9f822 https://github.com/pytorch/fbgemm/commit/e8650fd5601e28783f64f5a38541e6d562125375 Test Plan: n/a Reviewed By: yns88 fbshipit-source-id: abd7ee4aaec8401b2c885335940773a0655b4496 skip testExceptions in ProcessGroupGloo if built with TSAN (#32242) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32242 TSAN and fork don't play well together, so skip this test if we're building under TSAN. It will still run in other modes. Differential Revision: D19416113 fbshipit-source-id: 7e88d63a843356372160c2524c05e8fd1706553e Renaming IValue List functions (#32093) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32093 toGenericListRef -> toListRef isGenericList -> isList toGenericList -> toList toXListRef -> toXVector Test Plan: Imported from OSS Reviewed By: suo Differential Revision: D19369767 Pulled By: zdevito fbshipit-source-id: 4f0078f95b83e6586524c03f7bcf206722fdd9ae Updating submodules Summary: GitHub commits: https://github.com/facebookincubator/fizz/commit/54b290f00ff8a1e1bc12957f97d41b7f32b36268 https://github.com/facebookincubator/mvfst/commit/e8df50310d5d883660b409d2e484b6e05235ce3d https://github.com/pytorch/fbgemm/commit/ef5c9efe120d1e8b5b263ebe37be8cb0c9583cc2 Test Plan: n/a Reviewed By: yns88 fbshipit-source-id: 7b6dc88d40e8fd8c396d4d12846db43b0fb4258c Fix typos, via a Levenshtein-type corrector (#31523) Summary: Should be non-semantic. Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos, with https://github.com/bwignall/typochecker to help automate the checking. Uses an updated version of the tool used in https://github.com/pytorch/pytorch/pull/30606 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/31523 Differential Revision: D19216749 Pulled By: mrshenli fbshipit-source-id: 7fd489cb9a77cd7e4950c1046f925d57524960ea TensorIterator unrolling and vectorized load - step 0, 1 (#31974) Summary: This is step 0 and 1 for https://github.com/pytorch/pytorch/issues/31975: - Old code is moved to namespace `legacy` - New `elementwise_kernel` and `launch_kernel` added to namespace `modern`, they only support 1d contiguous case for now - In `gpu_kernel_impl`, dispatch to the new code if the problem is trivial 1d contiguous. In terms of performance, this PR affect elementwise operators on contiguous tensors. The performance is improved slightly (up to 8%) for medium size tensors on Volta. See https://github.com/zasdfgbnm/things/blob/master/2020Q1/disassembly-elementwise.ipynb We can see that, previously, the add kernel compiles to ``` //## File "/home/xgao/pytorch-master/aten/src/ATen/native/cuda/Loops.cuh", line 71 /*0000*/ IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ; /*0010*/ @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ; /*0020*/ S2R R0, SR_TID.X ; //## File "/home/xgao/pytorch-master/aten/src/ATen/native/cuda/Loops.cuh", line 73 /*0030*/ S2R R3, SR_CTAID.X ; /*0040*/ IMAD R0, R3, 0x200, R0 ; //## File "/home/xgao/pytorch-master/aten/src/ATen/native/cuda/Loops.cuh", line 76 /*0050*/ ISETP.GE.AND P0, PT, R0, c[0x0][0x160], PT ; /*0060*/ P0 EXIT ; //## File "/home/xgao/pytorch-master/aten/src/ATen/native/cuda/Loops.cuh", line 110 /*0070*/ IMAD R3, R0.reuse, c[0x0][0x194], RZ ; /*0080*/ IMAD R6, R0, c[0x0][0x198], RZ ; /*0090*/ IADD3 R4, P0, R3.reuse, c[0x0][0x178], RZ ; /*00a0*/ IADD3 R2, P1, R6.reuse, c[0x0][0x180], RZ ; /*00b0*/ LEA.HI.X.SX32 R5, R3, c[0x0][0x17c], 0x1, P0 ; /*00c0*/ LEA.HI.X.SX32 R3, R6, c[0x0][0x184], 0x1, P1 ; /*00d0*/ LDG.E.SYS R5, [R4] ; /*00e0*/ LDG.E.SYS R2, [R2] ; //## File "/home/xgao/pytorch-master/aten/src/ATen/native/cuda/Loops.cuh", line 77 /*00f0*/ IMAD R0, R0, c[0x0][0x190], RZ ; /*0100*/ IADD3 R6, P0, R0, c[0x0][0x170], RZ ; /*0110*/ LEA.HI.X.SX32 R7, R0, c[0x0][0x174], 0x1, P0 ; //## File "/home/xgao/pytorch-master/aten/src/ATen/native/cuda/Loops.cuh", line 110 /*0120*/ FFMA R9, R2, c[0x0][0x1a0], R5 ; //## File "/home/xgao/pytorch-master/aten/src/ATen/native/cuda/Loops.cuh", line 170 /*0130*/ STG.E.SYS [R6], R9 ; //## File "/home/xgao/pytorch-master/aten/src/ATen/native/cuda/Loops.cuh", line 81 /*0140*/ EXIT ; .L_16826: /*0150*/ BRA `(.L_16826); /*0160*/ NOP; /*0170*/ NOP; .L_29063: ``` Now it compiles to ``` //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 210 /*0000*/ MOV R1, c[0x0][0x28] ; /*0010*/ @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ; /*0020*/ S2R R6, SR_CTAID.X ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 217 /*0030*/ MOV R7, 0x4 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 208 /*0040*/ S2R R3, SR_TID.X ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 210 /*0050*/ LEA R6, R6, R3, 0x8 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 225 /*0060*/ IADD3 R2, R6.reuse, 0x40, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 217 /*0070*/ IMAD.WIDE R4, R6.reuse, R7.reuse, c[0x0][0x190] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 225 /*0080*/ IADD3 R3, R6, 0x80, RZ ; /*0090*/ ISETP.GE.AND P1, PT, R2, c[0x0][0x160], PT ; /*00a0*/ ISETP.GE.AND P0, PT, R6.reuse, c[0x0][0x160], PT ; /*00b0*/ ISETP.GE.AND P2, PT, R3, c[0x0][0x160], PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 217 /*00c0*/ IMAD.WIDE R2, R6.reuse, R7, c[0x0][0x188] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 225 /*00d0*/ IADD3 R14, R6, 0xc0, RZ ; /*00e0*/ ISETP.GE.AND P3, PT, R14, c[0x0][0x160], PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 228 /*00f0*/ @!P1 LDG.E.SYS R11, [R4+0x100] ; /*0100*/ @!P0 LDG.E.SYS R0, [R2] ; /*0110*/ @!P0 LDG.E.SYS R9, [R4] ; /*0120*/ @!P1 LDG.E.SYS R8, [R2+0x100] ; /*0130*/ @!P2 LDG.E.SYS R10, [R2+0x200] ; /*0140*/ @!P2 LDG.E.SYS R13, [R4+0x200] ; /*0150*/ @!P3 LDG.E.SYS R12, [R2+0x300] ; /*0160*/ @!P3 LDG.E.SYS R15, [R4+0x300] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 245 /*0170*/ IMAD.WIDE R6, R6, R7, c[0x0][0x180] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 191 /*0180*/ FFMA R9, R9, c[0x0][0x168], R0 ; /*0190*/ FFMA R11, R11, c[0x0][0x168], R8 ; /*01a0*/ FFMA R13, R13, c[0x0][0x168], R10 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 245 /*01b0*/ @!P0 STG.E.SYS [R6], R9 ; /*01c0*/ @!P1 STG.E.SYS [R6+0x100], R11 ; /*01d0*/ @!P2 STG.E.SYS [R6+0x200], R13 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 191 /*01e0*/ FFMA R15, R15, c[0x0][0x168], R12 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 244 /*01f0*/ P3 EXIT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 245 /*0200*/ STG.E.SYS [R6+0x300], R15 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 248 /*0210*/ EXIT ; .L_727: /*0220*/ BRA `(.L_727); /*0230*/ NOP; /*0240*/ NOP; /*0250*/ NOP; /*0260*/ NOP; /*0270*/ NOP; .L_32233: ``` The benchmark is for add kernel on Volta. See https://github.com/zasdfgbnm/things/blob/master/2020Q1/benchmark-unroll.ipynb For tensors of size from 2^20 to 2^30, previously we had ``` 1.5.0a0+dedd16b dedd16b4181cae81e37e978cd3bf24c1ba35ca05 33 µs ± 31.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 48.7 µs ± 75 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 78.9 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 140 µs ± 51.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 261 µs ± 71.4 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 506 µs ± 159 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 993 µs ± 189 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.96 ms ± 139 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 3.9 ms ± 955 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) 7.79 ms ± 187 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` Now we have ``` 1.5.0a0+b1a239b b1a239be8d529e89875fe47cd09964ef3a9516ac 30.4 µs ± 18 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 45.2 µs ± 46.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 75 µs ± 476 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 134 µs ± 192 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 253 µs ± 354 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 489 µs ± 138 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 961 µs ± 431 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.91 ms ± 578 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 3.8 ms ± 88.8 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) 7.57 ms ± 763 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` It is slightly better. Pull Request resolved: https://github.com/pytorch/pytorch/pull/31974 Differential Revision: D19450765 Pulled By: ngimel fbshipit-source-id: 79601bfceb5da84ff87384ba8193793eb4095a2e run code analysis against mobile interpreter (#32276) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32276 Include mobile interpreter in mobile code analysis pass, which has some manually registered ops in temporary namespaces. The mobile interpreter is still under development and these ops will be removed in the future. This is a temporary step for internal build experiment. Test Plan: Imported from OSS Differential Revision: D19426818 Pulled By: ljk53 fbshipit-source-id: 507453dc801e5f93208f1baea12400beccda9ca5 Specify requires_grad for Parameter replica so it's not always set to True by default (#32356) Summary: This is the proposed fix for issue https://github.com/pytorch/pytorch/issues/32018 Pull Request resolved: https://github.com/pytorch/pytorch/pull/32356 Differential Revision: D19450648 Pulled By: mrshenli fbshipit-source-id: c63eeb6e9f5a87ebe613dd7013907559f295a7ea Fix cudnn channels_last descriptors problem (#31952) Summary: This is to append fixes to https://github.com/pytorch/pytorch/issues/31783 so we can pull the fixes in without breaking tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/31952 Differential Revision: D19433839 Pulled By: ngimel fbshipit-source-id: 5b3d2f0b2a86aacd1d100dd86996ee0d63e5ee92 Updating submodules Summary: GitHub commits: https://github.com/facebook/fbthrift/commit/9b13f58aa1b1a5a65f21cf9a80f8552f5c07ff60 https://github.com/facebook/folly/commit/044b292accb454838008f0fe88eea0c78c9af27e https://github.com/pytorch/fbgemm/commit/e1f67bbf3da31ca8fc5f4f506d4791cd8883b448 Test Plan: n/a Reviewed By: yns88 fbshipit-source-id: 21df26f60f436eb8c1766f66afac4a0d93dd33d1 Back out "Calling JITed 8 Bit Fused SLS in FBGEMM from C2" (#32381) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32381 Original commit changeset: 0dfa936eb503 "Facebook" Temporary remedy for SEV : https://our.intern.facebook.com/intern/sevmanager/view/s/193726 Test Plan: Run CI tests Reviewed By: jspark1105 Differential Revision: D19458382 fbshipit-source-id: 731790f96b341ade5e70ff13e4b0b5fafad0fea6 Remove stray `@script` (#32235) Summary: This should be covered under recursive script now Pull Request resolved: https://github.com/pytorch/pytorch/pull/32235 Pulled By: driazati Differential Revision: D19414889 fbshipit-source-id: 85f8132401dbe44c9dbaef7c0350110f90eb9843 porting scatter_add to ATen (CPU) (#31662) Summary: Fixes [https://github.com/pytorch/pytorch/issues/24758](https://github.com/pytorch/pytorch/issues/24758). Pull Request resolved: https://github.com/pytorch/pytorch/pull/31662 Differential Revision: D19440824 Pulled By: ngimel fbshipit-source-id: b13443cfcc8bcb9ec21f1cddb5c6fbc0ef4bb0f2 Temporary workaround for BC test due to schema parser changes Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32324 Test Plan: Imported from OSS Differential Revision: D19438085 Pulled By: jamesr66a fbshipit-source-id: 3dd2586e73c890a7bdadd6cbb3df2c186f93199d Remove __torch__ from custom class qualname Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32301 Test Plan: Imported from OSS Differential Revision: D19431645 Pulled By: jamesr66a fbshipit-source-id: 198522a1641cb9f90fa4c614da4ca4162fadf456 Fix returning instance of custom class from method Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32312 Test Plan: Imported from OSS Differential Revision: D19433511 Pulled By: jamesr66a fbshipit-source-id: f048d5f60eaba992ee42fea2d318a59b3a156578 Test passing custom class instance to bound method Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32320 Test Plan: Imported from OSS Differential Revision: D19437335 Pulled By: jamesr66a fbshipit-source-id: 8f5166dbe6fc5704b12b6224932460b12be0d39b support torch script call over rpc (#32197) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32197 This is to reland https://github.com/pytorch/pytorch/pull/30063, the main change is to match a general exception and grep "pickle" error word in "test_script_functions_not_supported" unit test, as Python 3.5 and Python 3.6 throw different types of errors with different error message for the rpc call in the unit test. [test all]This diff makes following changes: 1. Providing a new set of python rpc privated APIs, they can accept an annotated TorchScript call and this call can be serialized, deserialized and executed in C++ without GIL. These privated APIs will be binded to JIT in the future, and they are different from public APIs as future JIT binded private APIs will be able to accept qualified_name, not callables. These private APIs are subject to be deprecated once JIT supports torch script function to be a JIT type. Also, these APIs require torch script function to be defined and annotated by users in python land, it can not be script class/module constructor or class/module methods. 2. This diff also allows public rpc APIs to accept an annotated TorchScript call and execute code path that above private APIs ran on. Therefore if users invoke an annotated TorchScript call over RPC, this call can be serialized, deserialized and executed in C++ without GIL as well. 3. The above private APIs call a newly defined C++ function to make rpc torch script call to be serialized, deserialized and executed in C++ land. This C++ function returns an ivalue::Future. so that in follow up diff this C++ function can be called when these privated APIs are binded to JIT. 4. script_call.cpp/.h and request_callback_impl.cpp files are refactored accordingly so that torch script call and builtin call can share same message type and codes. 5. refactored deserializeResponse() and added a new utility to deserizalize response to IValue ghstack-source-id: 96879167 ghstack-source-id: 96879167 Test Plan: unit test Differential Revision: D19402374 fbshipit-source-id: 04efcc7c167d08a6503f29efe55e76f2be4b2c5e Updating submodules Summary: GitHub commits: https://github.com/facebook/fb303/commit/ea6039a6c98f089b7d5b4455715effbf492deb80 https://github.com/facebook/fbthrift/commit/0d30b8e0fc3191b18d16e1ebb1d7db74dc39b082 https://github.com/facebook/fbzmq/commit/7acedd4723f1997d51638f583bee061abff3b58b https://github.com/facebook/folly/commit/4db6e3b78569d72dd2c11a13ba508daa02c97fac https://github.com/facebook/proxygen/commit/cd898afb5e249266789f76951ca1e8ded5a09d5f https://github.com/facebook/wangle/commit/cf5dd1120450ffe81be83f51396231907cfec325 https://github.com/facebookincubator/fizz/commit/08bdcfd87ed0b382956c6c1ee3ba01e2b48dab1d https://github.com/facebookincubator/katran/commit/fc84c09b8f104bb3b1497ff97132d39789b37ed1 https://github.com/facebookincubator/mvfst/commit/454d37976b88605aa3ff7cfc7f8f735d385e0bea https://github.com/pytorch/fbgemm/commit/a22e6b8cb480dadfdada25188c50d65acd39f649 Test Plan: n/a Reviewed By: yns88 fbshipit-source-id: b87550b26e69216be2a8e40870a6e7dab825261c support empty batch in group normalization (#32401) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32401 https://github.com/pytorch/pytorch/issues/12013 Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- 'test_GroupNorm_empty' Differential Revision: D19463720 fbshipit-source-id: 8ae44590fc5eeb1adc69a2345d7cc2187d3307ac Removed unused weight update in prepack. Moved zero point update to (#32254) Summary: qlinear/qconv to be consistent with data update. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32254 Differential Revision: D19422929 Pulled By: kimishpatel fbshipit-source-id: 595a4f7d6fde4978c94f3e720ec8645f3f2bdb7a Build: Respect USE_CUDNN=0, even if cudnn is found (#32404) Summary: Currently, setting `USE_CUDNN=0` has no effect and any cudnn library found on your system will be used anyway. This is especially problematic when your system has multiple CUDA versions installed, and you are building with a version that lacks a matching cudnn. CMake will find any other cudnn versions and you end up with both CUDA versions added to your compiler include paths. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32404 Differential Revision: D19499425 Pulled By: ezyang fbshipit-source-id: a9b3f6f9dc22033481c3c1c5999b1a7ef98468cb Make type of `Tensor.type()` more specific (#32353) Summary: Fixes the following issue: ``` $ cat test.py import torch t = torch.tensor(1.5) t.type(torch.float32)[None] $ mypy test.py test.py:4: error: Invalid index type "None" for "Union[str, Tensor]"; expected type "Union[int, slice]" Found 1 error in 1 file (checked 1 source file) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/32353 Differential Revision: D19499388 Pulled By: ezyang fbshipit-source-id: 715111e934aea020b20f850d27e32c4f70b82572 .circleci: Only run macos libtorch on master (#32378) Summary: These jobs were taking forver to run so we decided it's only really worth it to run it on master. Signed-off-by: Eli Uriegas <eliuriegas@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/32378 Differential Revision: D19499301 Pulled By: seemethere fbshipit-source-id: 22cac5b5baee84e44607a16daeb77048cb0f5974 F.normalize uses clamp_min_ inplace (#32360) Summary: We don't care about autograd when `out!=None` anyways Pull Request resolved: https://github.com/pytorch/pytorch/pull/32360 Differential Revision: D19452402 Pulled By: colesbury fbshipit-source-id: c54775289f8a700019ca61e951d59ff4894ac980 Synchronize with ShipIt. Signed-off-by: Edward Z. Yang <ezyang@fb.com> add an option to record time spent waiting for GIL (#30842) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30842 We'd like to profile the time spent on GIL acqusiition to debug performance issues. Test Plan: Unit tests pass. Differential Revision: D18837590 fbshipit-source-id: 925968f71c5fb96b8cd93f1eab4647602d2617d1 Fix cusparse version check (#32405) Summary: The current version check doesn't use proper lexicographic comparison and so will break for future versions of cuSPARSE with `CUSPARSE_VER_MAJOR > 10` and `CUSPARSE_VER_MINOR < 2`. Also, my cusparse headers for CUDA 9 don't seem to include version macros at all, so added `if !defined` to be explicit about that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32405 Differential Revision: D19499412 Pulled By: ezyang fbshipit-source-id: 1593bf1e5a4aae8b75bb3b350d016cc6c3b9c009 Remove dead includes in caffe2/test Reviewed By: ezyang Differential Revision: D19273220 fbshipit-source-id: 3dfc3388914e60611c84472e3fc529f5b5e40534 Set rpath for JNI library on Mac (#32247) Summary: Without this, dlopen won't look in the proper directory for dependencies (like libtorch and fbjni). Pull Request resolved: https://github.com/pytorch/pytorch/pull/32247 Test Plan: Build libpytorch_jni.dylib on Mac, replaced the one from the libtorch nightly, and was able to run the Java demo. Differential Revision: D19501498 Pulled By: dreiss fbshipit-source-id: 13ffdff9622aa610f905d039f951ee9a3fdc6b23 Fix BC test after TorchBind cahnges (#32429) Summary: It was broken by https://github.com/pytorch/pytorch/issues/32320. Let's be on the safe side and just whitelist all testing ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/32429 Differential Revision: D19501016 Pulled By: dzhulgakov fbshipit-source-id: 9cc1d363edb4579905bee1976a2b57255ce41738 Redundant condition (#32396) Summary: Optimize expression: 'A || (!A && B)' <=> 'A || B' A: relErr <= maxRelErr !A : relErr > maxRelErr B: absErr <= absErrForRelErrFailure Pull Request resolved: https://github.com/pytorch/pytorch/pull/32396 Differential Revision: D19499370 Pulled By: ezyang fbshipit-source-id: c19bdcb2d4e7ff7806a8cd181c6e7e9e276b9979 Enhance NCCL watchdog to acitvely abort communicators for timed out ops. (#32338) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32338 Timed out ops could linger around if the user doesn't actually call `wait()` on that OP. As result, to fix this I've introduced the following functionality in this PR: 1. Keep track of all outstanding work in ProcessGroupNCCL. 2. Enhance NCCL watchdog to sweep through all outstanding work and perform the following operations: i. If the work has timed out, abort all communicators for that work and remove them from the cache. ii. If the communicators for the work receive an error, abort the communicators and remove them from the cache. iii. If the work has completed (successfully/unsuccessfully), remove it from the list of outstanding work. ghstack-source-id: 96895704 Test Plan: waitforbuildbot Differential Revision: D19401625 fbshipit-source-id: 8f6f277ba2750a1e1aa03cdbc76e8c11862e7ce5 Revert "Temporary workaround for BC test due to schema parser changes" (#32441) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32441 This reverts commit ceffdbd2179e7dafdc6407909a00f4267db040de. Test Plan: Imported from OSS Reviewed By: houseroad Differential Revision: D19500043 Pulled By: jamesr66a fbshipit-source-id: 3bd22c55e4a81ff8b89d27f6e7438e3bdfc18606 Updating submodules Summary: GitHub commits: https://github.com/facebook/fbthrift/commit/47e0b9b97e19c34dc15a6abf0e8ed93063870ce8 https://github.com/facebook/folly/commit/6d225aaf95b58baf2420efec7f4c570a2d426395 https://github.com/pytorch/fbgemm/commit/ab4da8f60a0194f04c55aa4c9b74c5c175bd1172 Test Plan: n/a Reviewed By: zpao fbshipit-source-id: 27bcdf08b6f5e47a5c948e094aca26bf67a6fb66 QNNPACK: Add support for dynamic quantization. Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31896 Test Plan: Added new tests to QNNPACK's test suite to cover the new use case. All new tests are passing. Reviewed By: supriyar Differential Revision: D19443250 Pulled By: AshkanAliabadi fbshipit-source-id: fa7b1cffed7266a3c198eb591d709f222141a152 Updating submodules Summary: GitHub commits: https://github.com/facebook/fbthrift/commit/40b08129cfd2aed6dba56d10d8cea4ac0ef6932e https://github.com/facebook/proxygen/commit/8cd8d286e68a06968b80dd5a6d8e150392b87aea https://github.com/facebook/rocksdb/commit/d305f13e2124132863267eb49b2a08ede679d2c4 https://github.com/pytorch/fbgemm/commit/2957bd45f19d8fa2d185e26b7ada5a394c5ba5b4 Test Plan: n/a Reviewed By: zpao fbshipit-source-id: 3b76eb7c8b6b5cf617aca7bd143e1ee404c4f0ed Adagrad optimizer - updated step function, added param_groups, state to optimizers Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29335 Differential Revision: D19449382 Pulled By: anjali411 fbshipit-source-id: ee238801ed9cdf15a80f2ce31cc4aab8ba582aea Enhace DispatchStub to be thread safe from a TSAN point of view. (#32148) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32148 TSAN would complain about multiple threads reading and writing to the `cpu_dispatch_ptr` without any sort of synchronization. Although, this is a valid issue from a TSAN point of view, there wasn't a correctness issue since both threads would compute the same value. In order to fix this, I've used std::atomic for cpu_dispatch_ptr with relaxed ordering guarantees. ghstack-source-id: 96989435 Test Plan: Verify the TSAN tests pass. Differential Revision: D19386082 fbshipit-source-id: 1ff0893e02529eddd06b2855d9565edf1bbf1196 Fix test_data_parallel name errors and add to run_test.py (#32428) Summary: While working on https://github.com/pytorch/pytorch/issues/31768 and trying to add tests for `DataParallel`, I discovered that: - `test_data_parallel.py` can't be run through `run_test.py` - running it with `pytest` fails with many name errors `test_data_parallel.py` seems to have been split from `test_nn.py` in https://github.com/pytorch/pytorch/issues/28297 but not in a state where it can actually be run. Presumably `DataParallel` hasn't been tested by CI in the time since. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32428 Differential Revision: D19499345 Pulled By: ezyang fbshipit-source-id: f9b748a99a5c85fc6675c22506cf10bbfd9c8a4d Updating submodules Summary: GitHub commits: https://github.com/facebook/fbthrift/commit/d45f7b4f0972951c2548e918c0bc167f397815b3 https://github.com/facebook/rocksdb/commit/e6e8b9e8718698b334d18fa8f5ab6db30b147c53 https://github.com/facebookincubator/katran/commit/da618022d26b0786d4a090f38006db9ae584f2cb https://github.com/pytorch/fbgemm/commit/2df47f519a6c896b7c418a8a94aae9c07ba7285c Test Plan: n/a Reviewed By: zpao fbshipit-source-id: c4af09e70a56d11e845150ba3d90a570a3758e51 Move log_normal to Aten(CPU) (#31854) Summary: Fix https://github.com/pytorch/pytorch/issues/24723. Benchmark script : ``` import torch import torch.nn as nn import time torch.manual_seed(0) def _time(): return time.time() device = "cpu" for n in [10, 100, 1000]: input = torch.randn(128, n, requires_grad=False, device=device) for i in range(1000): input.log_normal_() for n in [1, 10, 100, 1000]: fwd_t = 0 input = torch.randn(128, n, requires_grad=False, device=device) for i in range(10000): t1 = _time() input.log_normal_() t2 = _time() fwd_t = fwd_t + (t2 -t1) fwd_avg = fwd_t / 10000 * 1000 print("input size(128, %d) forward time is %.4f (ms)." % (n, fwd_avg)) ``` Test Device: skx-8180. Before: ``` input size(128, 1) forward time is 0.0114 (ms). input size(128, 10) forward time is 0.1021 (ms). input size(128, 100) forward time is 1.0081 (ms). input size(128, 1000) forward time is 10.1831 (ms). ``` After: ``` input size(128, 1) forward time is 0.0108 (ms). input size(128, 10) forward time is 0.0969 (ms). input size(128, 100) forward time is 0.9804 (ms). input size(128, 1000) forward time is 9.6131 (ms). ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/31854 Differential Revision: D19314586 Pulled By: pbelevich fbshipit-source-id: 2ea1d9a2c505e36aca9e609b52ccb3e8caf2ba8f Updating submodules Summary: GitHub commits: https://github.com/facebook/proxygen/commit/d2ee8a1a3fc0bceee0dae34de37d1e23a8383977 https://github.com/pytorch/fbgemm/commit/a1543b168df44c4722fa545746aaaa7cf9660f6d Test Plan: n/a Reviewed By: zpao fbshipit-source-id: a1394f1c4a48920d3ce1403c70351e2c56eaecf0 `insert_quant_dequant` pass support shared class types (#31408) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31408 We'll error out when a graph is quantized with different QSchemes. This only occurs when we have two modules that have same types (e.g. two Conv2d modules initialized with same arguments) and quantized with two configs that would produce different quantized graphs, for example per tensor affine and per channel affine. This is a rare case, so it should be OK to skip for now. Actual support will come later. Test Plan: test_jit.py, test_quantization.py Imported from OSS Differential Revision: D19162366 fbshipit-source-id: 798f06d0ddef0c8458237ce88b62159cc77eec8b Remove the support of build options like NO_*, WITH_* (#32447) Summary: We will now use USE_*, BUILD_* consistently. The backward compatibility for NO_* and WITH_* is hereby removed in this commit, as promised in the comment (next release is beyond Feb 20): # Before we run the setup_helpers, let's look for NO_* and WITH_* variables and hotpatch environment with the USE_* # equivalent The use of NO_* and WITH_* is deprecated and will be removed in Feb 20, 2020. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32447 Differential Revision: D19515536 Pulled By: ezyang fbshipit-source-id: 2f2c51e6d4674af690b190a1f0397b8f596b6a15 Implement backend fallback fallthrough (#32439) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32439 This adds c10::fallthrough_kernel which is a special boxed function which can be used to implement fallthrough behavior at a dispatch key. A fallthrough kernel will redispatch to the next valid dispatch key. It is implemented in such a way that it costs no more to fallthrough than it does to go straight to the actual implementation of the kernel. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D19503886 Test Plan: Imported from OSS Pulled By: ezyang fbshipit-source-id: 6ee05bd815c4ef444e612d19f62312dbb76f2787 fix torch.eq() doc entry (#32399) Summary: fix `torch.eq()` entry example to match the current output (boolean, instead of uint8) Pull Request resolved: https://github.com/pytorch/pytorch/pull/32399 Differential Revision: D19498104 Pulled By: ezyang fbshipit-source-id: e7ec1263226766a5c549feed16d22f8f172aa1a3 Always return a new tensor from nn.functional.pad (#32350) Summary: Fixes https://github.com/pytorch/pytorch/issues/31734 Pull Request resolved: https://github.com/pytorch/pytorch/pull/32350 Differential Revision: D19501845 Pulled By: ezyang fbshipit-source-id: ea79496d23dc0016f3caa233c53d283b08f60371 Put sparse all reduce results to input tensors (#32226) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32226 right now if users call torch.dist.all_reduce() on dense tensors, outputs are put in input tensors. but if users call torch.dist.all_reduce() on sparse tensors, outputs are neither returned explicitly to users nor are put in input tensors. To make torch.dist.all_reduce() API have same behavior on both dense tensors and sparse tensors, this diff is made to make torch.dist.all_reduce() on sparse tensors to put output in input tensors as well. This is acheived by simply calling input_sparse.copy_(output_sparse), see PR https://github.com/pytorch/pytorch/pull/9005 that implemented copy_ for sparse tensors. close #31413 ghstack-source-id: 96984228 Test Plan: unit test Differential Revision: D19192952 fbshipit-source-id: 2dd31dc057f20cc42b44b9e55df864afa2918c33 Fix dll load logic for Python 3.8 on Windows (#32215) Summary: Fixes https://github.com/pytorch/pytorch/issues/31181 and https://github.com/pytorch/pytorch/pull/31162#discussion_r362495611. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32215 Differential Revision: D19501869 Pulled By: ezyang fbshipit-source-id: 363824e52d2592ad968ecf1df345aa4c0daff915 Migrate max and min (binary) from TH to ATen. (#30851) Summary: TH implementation will be removed after the unary max and min are migrated. Benchmark: (Debian 10, Release build, gcc 7.4, no turbo) ```python import timeit for device in ('cpu', 'cuda'): print(f'device: {device}') for op in ('max', 'min'): for dtype in ('torch.double', 'torch.float', 'torch.int16', 'torch.int32', 'torch.int64'): for n, t in [(10_000, 200000), (100_000, 20000)]: print(f'torch.{op}(a, b), numel() == {n} for {t} times, dtype={dtype}') print(timeit.timeit(f'torch.{op}(a)' + (';torch.cuda.synchronize()' if device == 'cuda' else ''), setup=f'import torch; a = torch.arange({n}, dtype={dtype}); b = torch.ones({n}, 0, dtype={dtype}) * ({n} / 2)', number=t)) print() ``` Before: ``` device: cpu torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.double 2.241763713000182 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.double 1.7138833169992722 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.float 2.2183356810000987 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.float 1.7031846980007685 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int16 1.7704679510006827 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int16 1.289198366999699 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int32 1.7937613740014058 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int32 1.2930124340000475 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int64 1.8032857640009752 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int64 1.2908709189996443 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.double 1.8829010000008566 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.double 1.2994690759987861 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.float 1.8037853410005482 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.float 1.2929310759991495 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int16 1.8075240359994496 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int16 1.2932477679987642 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int32 1.7868400779989315 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int32 1.2885970789993735 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int64 1.8389664830010588 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int64 1.29402057399966 device: cuda torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.double 4.787109836999662 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.double 1.842438002999188 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.float 3.429616614999759 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.float 1.835390076999829 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int16 2.940423873000327 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int16 1.4108991760003846 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int32 2.9318018840003788 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int32 1.4168134739993548 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int64 2.9610764919998473 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int64 1.4189234130008117 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.double 2.960172712999338 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.double 1.4162539499993727 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.float 2.8985912560001452 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.float 1.4113489299998037 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int16 2.9160250799995993 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int16 1.4128787690005993 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int32 2.8806865219994506 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int32 1.4086357010000938 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int64 2.9362181240012433 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int64 1.4151225870009512 ``` After: ``` device: cpu torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.double 2.2685823729998447 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.double 1.72004808300062 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.float 2.212242640000113 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.float 1.7089235590001408 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int16 1.7767087259999244 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int16 1.2916517639996528 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int32 1.8265984959998605 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int32 1.3002885240002797 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int64 1.8084679720004715 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int64 1.3012119999993956 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.double 1.8800218449996464 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.double 1.3060645710002063 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.float 2.4905043950002437 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.float 1.9126290209997023 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int16 1.7972335520007618 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int16 1.2918074379995232 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int32 1.8047651860006226 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int32 1.2992197730000044 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int64 1.8526509560006161 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int64 1.3030709570002728 device: cuda torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.double 4.700986622000528 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.double 1.8415469050005413 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.float 3.3051693249999516 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.float 1.8321999460004008 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int16 2.8086475109994353 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int16 1.405110773999695 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int32 2.913458047999484 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int32 1.4236377289998927 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int64 2.9386842409994642 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int64 1.4230227469997772 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.double 3.0341797270002644 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.double 1.4289592409995748 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.float 3.6091147850002017 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.float 2.036691903999781 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int16 2.8256167649997224 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int16 1.4078955400000268 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int32 2.8631781489993955 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int32 1.4210130069996012 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int64 3.0112479260005784 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int64 1.4297719679998409 ``` Solve partly https://github.com/pytorch/pytorch/issues/24594 #24595 Close https://github.com/pytorch/pytorch/issues/25016 Continuing https://github.com/pytorch/pytorch/issues/27185 Pull Request resolved: https://github.com/pytorch/pytorch/pull/30851 Differential Revision: D19515694 Pulled By: ezyang fbshipit-source-id: 1764897f912d6ae24b0c361f19a1aacf96e0826e add missing align_corners annotation (#32492) Summary: adds the missing annotation in grid_sample and affine_grid functional Pull Request resolved: https://github.com/pytorch/pytorch/pull/32492 Differential Revision: D19516550 Pulled By: ezyang fbshipit-source-id: 064c8c99bf6eae6744237c0b151b3ce4c82ada96 Move some of the helper functions for public use (#32202) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32202 Move some helper functions in ModuleUseDeduper for public use Test Plan: . Imported from OSS Differential Revision: D19508034 fbshipit-source-id: 2e8e05eff6f3bbcfe6936598371e4afa72f9b11f Fix comparisions for ConcreteModuleType (#32256) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32256 Previously two unrelated modules loaded from torch.jit.load would compare equal because we only considered their data_ attributes which are initialized blank in torch.jit.load. This changes ConcreteModuleType to distinguish when the data_ attribute is blank vs when it is empty. This replaces the poisoned logic. ghstack-source-id: 96755797 Test Plan: oss Differential Revision: D19423055 fbshipit-source-id: 79d6a50a3731c6eeb8466ba2a93702b49264bba0 Add str[] float[] constants resubmit Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31791 Test Plan: Imported from OSS Reviewed By: driazati Differential Revision: D19439513 Pulled By: eellison fbshipit-source-id: a04c7401687b051f0d4fb4794963931ebe004194 improve mayContainAlias (#31839) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31839 There are a number of improvements that can be made to `mayContainAlias`, which I would like to do in follow ups. For now, this is an easy one. Test Plan: Imported from OSS Differential Revision: D19439516 Pulled By: eellison fbshipit-source-id: 0042fb7eaae6cfb4916bf95dc38280517a4bd987 remove tuple logic in constant propagation (#31840) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31840 The next PR in this stack makes tuples insertable as constants, so we can remove special handling of tuples in constant propagation. Test Plan: Imported from OSS Differential Revision: D19439515 Pulled By: eellison fbshipit-source-id: c58f153157f1d4eee4c1242decc4f36e41c1aa05 implement tuple constants (#31841) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31841 Add Tuple Constants to JIT. The constraint here is that all elements of a tuple must themself be insertable as a a constant. Previously tuples were special cased in constant propagation, but now that there are more passes that are inserted constants, such as freezing, we should just have tuples be representable as constants. Test Plan: Imported from OSS Differential Revision: D19439514 Pulled By: eellison fbshipit-source-id: 3810ba08ee349fa5598f4b53ea64525996637b1a Adding QConfigTypePtrMap (#32203) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32203 The type is needed for allowing multiple qconfig configurations for shared ClassType, see next PR for more details Test Plan: . Imported from OSS Differential Revision: D19508027 fbshipit-source-id: a3df29dab3038bfa88c55dda98a3e8a78e99e5a1 Remove mis-exposed abort API on ProcessGroup Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32292 Test Plan: Imported from OSS Differential Revision: D19430252 Pulled By: mrshenli fbshipit-source-id: 4ec594e1be54afe774bdcecc0f1c9bda2edf5e0d Corrected logical boolean expression (#32249) Summary: Changed bitwise & to logical && in the boolean expression. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32249 Differential Revision: D19501586 Pulled By: eellison fbshipit-source-id: afe374cfc9661182703cc82810d9cb735fbb8180 [caffe2] remove unnecessary np.set_printoptions and fix test errors (#32475) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32475 As title Test Plan: CI Reviewed By: houseroad Differential Revision: D19508778 fbshipit-source-id: fd9ad63607535980505d155f3e3c3b7c6b95daf7 Updating submodules Summary: GitHub commits: https://github.com/facebook/fbthrift/commit/87b81e7cb2e17d6cb2289d678decd9311136ab28 https://github.com/facebook/folly/commit/3a9a0976f2537ed66a465bf30ec2038a7a92d636 https://github.com/facebook/litho/commit/9294f3b2faeded509b6fb0c2780b4bf4d4e6d763 https://github.com/facebook/proxygen/commit/c8addc5ad4ebf73a2dbb8a00e0d9e68dfdf12cd7 https://github.com/facebookincubator/profilo/commit/9a9f1a849a33248fa4d7f06a100cfa73257de233 https://github.com/pytorch/fbgemm/commit/27cb280170fbf530033c4d0123e063e2f8bb50f3 Test Plan: n/a Reviewed By: zpao fbshipit-source-id: 73beec64bf9c17fa6c42dd09ea85350e8c9c66ea [jit] Enable IValue to hold a PyObject (#32491) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32491 This PR enables IValue to be able to hold a pure PyObject by adding a new enum tag, a new jit_type to denote PyObject existance in IValue and the JIT type system. We don't and not plan to expose this to user. This is the basic piece that enable ivalue to be adopted broader like making RRef always hold IValue, it might also simplify some compiler logic ghstack-source-id: 97039980 Test Plan: Imported from OSS Differential Revision: D19502234 fbshipit-source-id: 90be001706d707d376cfbea25980fd82980df84a Fix race condition for to() backward that spans devices (#31930) Summary: While putting finishing touches on the gradient scaling PR (https://github.com/pytorch/pytorch/pull/26512), I discovered my multi-GPU test (which uses `to()` to transfer tensors between devices) was intermittently failing with bad numerics. I knew it was going to be [a weird case from the start](https://www.imdb.com/title/tt8946378/quotes/qt4868203) and spent a week descending into madness. It turns out, for backward ops that create gradients on a different device from the device on whose stream the op is executed, the streaming backward synchronizations in [input_buffer.cpp](https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/input_buffer.cpp#L46-L83) do not properly tell later ops to wait on the population/creation of those gradients. For example, a cross-device `to()` backward (CopyBackward Node) enqueues a cudaMemcpyAsync on the current stream of the source (incoming gradient's) device, then [syncs getCurrentCUDAStream on the destination device with the cudaMemcpyAsync](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/Copy.cu#L76). However, `input_buffer.cpp` in such cases ([case (3)](https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/input_buffer.cpp#L77-L81)) was not properly telling `opt_consumer_stream` to wait on the current stream of the destination device (`var`'s device). Circumstances needed to repro in current master (see [my test](https://github.com/pytorch/pytorch/compare/master...mcarilli:backward_to_race_fix#diff-e68a7bc6ba14f212e5e7eb3727394b40R1901)): - 2 devices, with non-default streams used for forward-pass ops on both devices (which is the default behavior in test_cuda.py) - A `to()` that transfers a tensor requiring grad from one device to another - A backward pass that routes back through to()'s backward (aka CopyBackward). Under these circumstances, backward ops following CopyBackward on CopyBackward's destination device (aka the original forward-pass source device) race with the device-to-device transfer, and execute using partially-transferred data. The present PR fixes the race condition and ensures that later ops wait on the CopyBackward transfer. This PR should also make streaming backward safe for other backward ops that span devices, as long as they play nice and populate any new gradients they create using the "current stream" of the device(s) on which they create those gradients. There are a couple minor issues where I'm not sure of the best approach: - Should we guard onto the var's device for the entire body of InputBuffer::add? - I'm fairly sure we need to `recordStream` on `var` if the consumer stream is different from the stream on which (we expect) `var` was created, but calling `c10::cuda::CUDACachingAllocator::recordStream` in input_buffer.cpp might break CPU-only builds. I couldn't find a different API call to record streams that seemed CPU-build-agnostic. Could I wrap the call with a macro? Thanks to mruberry for helpful suggestions and also the organization/naming of the stream pool and streaming backward code that allowed me to (just barely) wrap my head around the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/31930 Differential Revision: D19517617 Pulled By: mruberry fbshipit-source-id: 183d5460aefa5d27366b465b0473b80ec80fa044 [Rowwise Pruning][c2 op] Add Quantile Op (#32448) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32448 Using binary search to compute the value for the given quantile among the input tensors. Test Plan: Newly added unittests; Reviewed By: jspark1105 Differential Revision: D19487604 fbshipit-source-id: 0dc6627b78d1310ac35b3f1d53b89cc89a697ece [caffe2] use 2-stage EmbeddingSpMDM interface (#32271) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32271 Use the 2-stage EmbeddingSpMDM interface in D19425982 to reduce the overhead of code cache lookup and lock contention. Fix an issue in sparse_lengths_sum_benchmarks generating empty indices when average length is small like 1. Test Plan: CI Reviewed By: dskhudia Differential Revision: D19425987 fbshipit-source-id: d5c5f0d46e0072403901809c31d516fa0f4b9b31 Move pytorch distributed tests to separate folder for contbuild. (#30445) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30445 Create distributed and rpc directories under caffe/test for better management of unit tests. Differential Revision: D18702786 fbshipit-source-id: e9daeed0cfb846ef68806f6decfcb57c0e0e3606 [gloo] Skip registry warning (#31126) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31126 Gloo device creator registry is throwing warning that confuses users - https://fb.workplace.com/groups/1405155842844877/permalink/3217491788277931/ Create C10_DEFINE_SHARED_REGISTRY_WITHOUT_WARNING API to skip such warning Test Plan: {F224342749} Tested both `C10_DEFINE_SHARED_REGISTRY` and `C10_DEFINE_SHARED_REGISTRY_WITHOUT_WARNING`. Make sure nothing breaks Reviewed By: d4l3k Differential Revision: D18904783 fbshipit-source-id: 0e0065d530956249a18325d4ed3cb58dec255d4c Raise error for code that risk deadlock (#32295) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32295 Fix for https://github.com/pytorch/pytorch/issues/32045 Calling into the engine with the GIL can deadlock because: - worker thread initialization acquires the GIL - Any Node / hook can be a python function that will acquire the GIL The choice was made here to raise an error as one of the advantage of using cpp extensions with python is to be able to release the GIL. So we prefer to educate users to do it rather than doing it under the hook. Test Plan: Imported from OSS Differential Revision: D19430979 Pulled By: albanD fbshipit-source-id: e43f57631885f12e573da0fc569c03a943cec519 [PyTorch BC] Clean up the whitelist for PyTorch Op BC check (#32523) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32523 remove stale items Test Plan: cont build Reviewed By: hl475 Differential Revision: D19526918 fbshipit-source-id: ee7392ae84e5ddf88284020775119e59c9b6533e [quant][graphmode] Default to non-inplace in graph mode quantization API (#32204) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32204 att Test Plan: . Imported from OSS Differential Revision: D19508030 fbshipit-source-id: 94814c3c126a196f3938f944abfa5ae2a24d8dde Fix nll_loss to support empty tensors on GPU (#31491) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31491 Fixes #31472 Test Plan: Imported from OSS Differential Revision: D19537231 Pulled By: pbelevich fbshipit-source-id: 20a43251a0f68a7a3557dd8234daee2d4814e5dd Add unit test on export_opnames with interface. (#31531) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31531 As suggested by suo , add unit test on torch.jit.export_opnames with interface. A submodule is annotated as interface and assigned to an instance, and then re-assigned to another instance. Make sure the operator names are also updated. Test Plan: Imported from OSS Differential Revision: D19539129 Pulled By: iseeyuan fbshipit-source-id: 71a76ae7790cdd577618ca278afdb132727f08dc Support 3D attention mask in MultiheadAttention. (#31996) Summary: Support a 3D attention mask for MultiheadAttention. If `attn_mask` has the batch dimension, it will not be unsqueezed. Fix https://github.com/pytorch/pytorch/issues/30678 Relevant issues/pr: https://github.com/pytorch/pytorch/pull/25359 https://github.com/pytorch/pytorch/issues/29520 Pull Request resolved: https://github.com/pytorch/pytorch/pull/31996 Differential Revision: D19332816 Pulled By: zhangguanheng66 fbshipit-source-id: 3448af4b219607af60e02655affe59997ad212d9 [JIT] throw if no self arg on ignored methods (#32503) Summary: There was a user who did this and it would seg fault. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32503 Differential Revision: D19538481 Pulled By: eellison fbshipit-source-id: dc3752028b9eff6ac88c025e8a2b5f8fd44ce32f [quant][graphmode] Support quantizing shared ClassType with different qconfigs (#32205) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32205 to be filled Test Plan: python test_jit.py Imported from OSS Differential Revision: D19508031 fbshipit-source-id: cbf03d34e52eae62595c34fde6ec645cb6744ad9 no more build_pytorch_libs.sh/.bat (#32319) Summary: https://github.com/pytorch/pytorch/issues/12918 Pull Request resolved: https://github.com/pytorch/pytorch/pull/32319 Differential Revision: D19544272 Pulled By: soumith fbshipit-source-id: dd32fa61efa78af908f21c7e54cb6484bf895e54 Only run test_conv_large and test_conv_transposed_large_cuda on 32GB device (#32473) Summary: For some reason, these two tests start to fail on 16GB Volta on Linux... Also fixes https://github.com/pytorch/pytorch/issues/31650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/32473 Differential Revision: D19538314 Pulled By: ngimel fbshipit-source-id: 266195f19d8cf76b035795e0e318c152ae72adc2 [JIT] Passing custom class as arg (#32260) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32260 This makes it so you can actually pass the custom class as an arg to ScriptFunctions Test Plan: Imported from OSS Differential Revision: D19424252 Pulled By: jamesr66a fbshipit-source-id: c3530186619655781dedbea03c2ad321aaff1cb8 [JIT] Test __getstate__ and __setstate__ for custom bound C++ classes Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32470 Test Plan: Imported from OSS Differential Revision: D19508250 Pulled By: jamesr66a fbshipit-source-id: 481299fb3c18fa874c2a1d2993984bb6b3193bac [JIT] Fix custom class method binding for const methods Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32471 Test Plan: Imported from OSS Differential Revision: D19508249 Pulled By: jamesr66a fbshipit-source-id: 3a0bce6845072bb03567049a73b9982b54d8daf9 [JIT] Support returning tuple from custom bound C++ method Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32477 Test Plan: Imported from OSS Differential Revision: D19509927 Pulled By: jamesr66a fbshipit-source-id: 7d407150402cc19344c3ec3b4a27b3d7c464e8ac [JIT] Add torch.classes.load_library Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32508 Test Plan: Imported from OSS Differential Revision: D19525175 Pulled By: jamesr66a fbshipit-source-id: b9f07113f551bdfb56d49d24d12989be2b8fc7e4 Revert "Remove __torch__ from custom class qualname" (#32514) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32514 This reverts commit c7fdf5b251c6fecd5d78b4f33d30bd77ca3f841c. Test Plan: Imported from OSS Differential Revision: D19525532 Pulled By: jamesr66a fbshipit-source-id: 126f4e87250a2ac739bd7aa161a0f7b39f143d38 [quant] Re-enable test_nested that has different qconfig for shared ClassType (#32206) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32206 att Test Plan: python test/test_quantization.py Imported from OSS Differential Revision: D19508028 fbshipit-source-id: 5de3c2ef17de146feca03d7135a7e04f393de398 porting gather to ATen using TensorIterator with multithreading support. (#32425) Summary: Fixes [https://github.com/pytorch/pytorch/issues/24702](https://github.com/pytorch/pytorch/issues/24702). Pull Request resolved: https://github.com/pytorch/pytorch/pull/32425 Differential Revision: D19538265 Pulled By: ngimel fbshipit-source-id: 78821a16b6948916e956a04f984e0956f86cf582 [JIT] Remove capsule type handling of node hashing (#32540) Summary: Capsule Type doesn't appear in the IR, it is purely used in the runtime. So we should not have to handle it node hashing... Let's see if this breaks anything. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32540 Differential Revision: D19541357 Pulled By: eellison fbshipit-source-id: 905ed9f89cf6d03b45ddb4fde02adfa149b477f8 Updating submodules Summary: GitHub commits: https://github.com/facebook/fbthrift/commit/08e28edc08dea3b96bc5eab84c10efecee580133 https://github.com/facebook/folly/commit/6884ecfc6724b30f3f54899889f309f81650e125 https://github.com/facebook/mcrouter/commit/685144514fc59139189b75f7a1c3387a992670e2 https://github.com/pytorch/fbgemm/commit/ed665880aa9b017b04af40193a22bcc933ddabad Test Plan: n/a Reviewed By: zpao fbshipit-source-id: 7b19dca06ad7e8751de21efc48f5eada37b446fb [rpc] Remove template on RRef and add Type to RRef creation (#30630) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30630 This remove template and all the specializations it have in rpc, we universally use IValue as the inner value since we support making python object to be hold inside IValue. This will also ensure that we have the correct type information when creating the RRef, we use the return type from the schema when creating userRRef and OwnerRRef, it will enable IValue to always have the correct type if the IValue is the RRef object (next PR) Test Plan: Imported from OSS Differential Revision: D19502235 fbshipit-source-id: 0d5decae8a9767e0893f3b8b6456b231653be3c5 [pytorch][embeddingbag] Parallelize the EmbeddingBag operator (#…

apaszke · 2020-03-04T18:13:11Z

torch/__init__.py

+    here = os.path.abspath(__file__)
+    lib_path = os.path.join(os.path.dirname(here), 'lib', lib_name)
+
+    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)


I've been going through this out of curiosity and it got me wondering if this doesn't lead to an eventual dlclose? Don't we have to stash this library handle somewhere?

https://stackoverflow.com/questions/359498/how-can-i-unload-a-dll-using-ctypes-in-python suggests that it doesn't

apaszke · 2020-03-04T18:14:08Z

torch/_utils_internal.py

@@ -54,3 +54,4 @@ def get_source_lines_and_file(obj):

 TEST_MASTER_ADDR = '127.0.0.1'
 TEST_MASTER_PORT = 29500
+USE_RTLD_GLOBAL_WITH_LIBTORCH = False


Why do we have this constant if it's always false? Is this so that you can patch it in fbcode?

Yep, fbcode shenanigans

Summary: Fixes pytorch#31181 and pytorch#31162 (comment). Pull Request resolved: pytorch#32215 Differential Revision: D19501869 Pulled By: ezyang fbshipit-source-id: 363824e52d2592ad968ecf1df345aa4c0daff915

Don't use RTLD_GLOBAL to load _C.

beb3b40

Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]

Update on "Don't use RTLD_GLOBAL to load _C."

a0484e6

Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]

ezyang mentioned this pull request Dec 12, 2019

Pick parallel MKL implementation over sequential implementation. #31165

Closed

ezyang mentioned this pull request Dec 12, 2019

Stop using RTLD_GLOBAL for _C #28536

Closed

Update on "Don't use RTLD_GLOBAL to load _C."

3231d91

Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]

ezyang mentioned this pull request Dec 12, 2019

Move AutogradMeta and DeviceGuardImplInterface virtual methods out-of-line. #31176

Closed

Update on "Don't use RTLD_GLOBAL to load _C."

a08fa45

Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]

ezyang mentioned this pull request Dec 12, 2019

Use libmkl_rt, or statically link against MKL #31177

Closed

ezyang added a commit that referenced this pull request Dec 12, 2019

Don't use RTLD_GLOBAL to load _C.

0902fe0

Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: 8f6e17418ca8503cb17b40ee5f5212221bacf2ff Pull Request resolved: #31162

Update on "Don't use RTLD_GLOBAL to load _C."

0d0f466

Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]

ezyang added a commit that referenced this pull request Dec 13, 2019

Don't use RTLD_GLOBAL to load _C.

3cd21ef

Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: 764d6138f7e3d113677e65cc06b6c103e3ec1c55 Pull Request resolved: #31162

Update on "Don't use RTLD_GLOBAL to load _C."

1060202

Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]

Update on "Don't use RTLD_GLOBAL to load _C."

22a565c

Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]

Update on "Don't use RTLD_GLOBAL to load _C."

4fc627b

Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]

ezyang mentioned this pull request Dec 13, 2019

Don't unconditionally compile runJITCPPTests #31236

Closed

Update on "Don't use RTLD_GLOBAL to load _C."

b796cb3

Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]

ezyang added a commit that referenced this pull request Dec 13, 2019

Don't use RTLD_GLOBAL to load _C.

3aafba4

Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: 948c142f04846261b8840060d9522112a45e8156 Pull Request resolved: #31162

albanD reviewed Jan 7, 2020

View reviewed changes

peterjc123 reviewed Jan 8, 2020

View reviewed changes

torch/__init__.py Outdated Show resolved Hide resolved

peterjc123 reviewed Jan 8, 2020

View reviewed changes

facebook-github-bot closed this in ddff4ef Jan 9, 2020

facebook-github-bot added the merged label Jan 9, 2020

facebook-github-bot deleted the gh/ezyang/580/head branch January 13, 2020 15:39

peterjc123 mentioned this pull request Jan 15, 2020

Fix dll load logic for Python 3.8 on Windows #32215

Closed

peterbell10 mentioned this pull request Feb 19, 2020

CuDNN backend not available in nightly (20200205) #33016

Closed

apaszke reviewed Mar 4, 2020

View reviewed changes

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't use RTLD_GLOBAL to load _C. #31162

Don't use RTLD_GLOBAL to load _C. #31162

ezyang commented Dec 12, 2019 •

edited

kostmo commented Dec 12, 2019 •

edited

albanD Jan 7, 2020

peterjc123 Jan 8, 2020

ezyang Jan 9, 2020

facebook-github-bot commented Jan 9, 2020

apaszke Mar 4, 2020

ezyang Mar 4, 2020

apaszke Mar 4, 2020

ezyang Mar 4, 2020

Don't use RTLD_GLOBAL to load _C. #31162

Don't use RTLD_GLOBAL to load _C. #31162

Conversation

ezyang commented Dec 12, 2019 • edited

kostmo commented Dec 12, 2019 • edited

💊 CircleCI build failures summary and remediations

Detailed failure analysis

1 failure not recognized by patterns:

albanD Jan 7, 2020

Choose a reason for hiding this comment

peterjc123 Jan 8, 2020

Choose a reason for hiding this comment

ezyang Jan 9, 2020

Choose a reason for hiding this comment

facebook-github-bot commented Jan 9, 2020

apaszke Mar 4, 2020

Choose a reason for hiding this comment

ezyang Mar 4, 2020

Choose a reason for hiding this comment

apaszke Mar 4, 2020

Choose a reason for hiding this comment

ezyang Mar 4, 2020

Choose a reason for hiding this comment

ezyang commented Dec 12, 2019 •

edited

kostmo commented Dec 12, 2019 •

edited