Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port silu_backward to structured #58661

Closed
wants to merge 9 commits into from
Closed

Conversation

Freey0
Copy link
Contributor

@Freey0 Freey0 commented May 20, 2021

Stack from ghstack:

Differential Revision: D28572530

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented May 20, 2021

💊 CI failures summary and remediations

As of commit e463470 (more details on the Dr. CI page and at hud.pytorch.org/pr/58661):


  • 4/4 failures possibly* introduced in this PR
    • 1/4 non-scanned failure(s)

🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (1/3)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in aten/src/ATen/native/native_functions.yaml
Auto-merging aten/src/ATen/native/native_functions.yaml
CONFLICT (add/add): Merge conflict in aten/src/ATen/native/cuda/Activation.cu
Auto-merging aten/src/ATen/native/cuda/Activation.cu
CONFLICT (add/add): Merge conflict in aten/src/ATen/native/cpu/Activation.cpp
Auto-merging aten/src/ATen/native/cpu/Activation.cpp
CONFLICT (add/add): Merge conflict in aten/src/ATen/native/Activation.h
Auto-merging aten/src/ATen/native/Activation.h
CONFLICT (add/add): Merge conflict in aten/src/ATen/native/Activation.cpp
Auto-merging aten/src/ATen/native/Activation.cpp
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (2/3)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in aten/src/ATen/native/native_functions.yaml
Auto-merging aten/src/ATen/native/native_functions.yaml
CONFLICT (add/add): Merge conflict in aten/src/ATen/native/cuda/Activation.cu
Auto-merging aten/src/ATen/native/cuda/Activation.cu
CONFLICT (add/add): Merge conflict in aten/src/ATen/native/cpu/Activation.cpp
Auto-merging aten/src/ATen/native/cpu/Activation.cpp
CONFLICT (add/add): Merge conflict in aten/src/ATen/native/Activation.h
Auto-merging aten/src/ATen/native/Activation.h
CONFLICT (add/add): Merge conflict in aten/src/ATen/native/Activation.cpp
Auto-merging aten/src/ATen/native/Activation.cpp
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_linux_bionic_py3_8_gcc9_coverage_test2 (3/3)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 25 21:53:07 AssertionError: False is not tr...lowed difference with rtol=0 and atol=0 is only 0!
Jun 25 21:53:07 ----------------------------------------------------------------------
Jun 25 21:53:07 Traceback (most recent call last):
Jun 25 21:53:07   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 399, in wrapper
Jun 25 21:53:07     self._join_processes(fn)
Jun 25 21:53:07   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 606, in _join_processes
Jun 25 21:53:07     self._check_return_codes(elapsed_time)
Jun 25 21:53:07   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 655, in _check_return_codes
Jun 25 21:53:07     self.assertEqual(
Jun 25 21:53:07   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1498, in assertEqual
Jun 25 21:53:07     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
Jun 25 21:53:07 AssertionError: False is not true : Scalars failed to compare as equal! Comparing -6 and 0 gives a difference of 6, but the allowed difference with rtol=0 and atol=0 is only 0!
Jun 25 21:53:07 Expect process 1 exit code to match Process 0 exit code of 0, but got -6
Jun 25 21:53:07 
Jun 25 21:53:08 ----------------------------------------------------------------------
Jun 25 21:53:08 Ran 367 tests in 1387.774s
Jun 25 21:53:08 
Jun 25 21:53:08 FAILED (failures=1, skipped=7)
Jun 25 21:53:08 
Jun 25 21:53:08 Generating XML reports...
Jun 25 21:53:08 Generated XML report: test-reports/python-unittest/distributed.rpc.test_tensorpipe_agent/TEST-TensorPipeDdpComparisonTestWithSpawn-20210625213000.xml
Jun 25 21:53:08 Generated XML report: test-reports/python-unittest/distributed.rpc.test_tensorpipe_agent/TEST-TensorPipeDdpUnderDistAutogradTestWithSpawn-20210625213000.xml

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@ezyang
Copy link
Contributor

ezyang commented May 20, 2021

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.


I removed dispatch: CompositeImplicitAutograd: math_silu_backward
Definitely not right, but I don't know how it works with structured core.
Keeping it will trigger an assertion failure

```
assert dispatch.keys() != {DispatchKey.CompositeImplicitAutograd}, \
    f"unexpected name for singleton CompositeImplicitAutograd dispatch entry: expected {cpp.name(func)} " \
    f"but got {dispatch[DispatchKey.CompositeImplicitAutograd]}.  Rename your implementation to the expected " \
    "name, then delete the dispatch table"
```

Differential Revision: [D28572530](https://our.internmc.facebook.com/intern/diff/D28572530)

[ghstack-poisoned]
Freey0 added a commit that referenced this pull request May 20, 2021
ghstack-source-id: 7b94c707ec5366d05219adf6e0756283c516ec9d
Pull Request resolved: #58661
Freey0 added a commit that referenced this pull request May 22, 2021
ghstack-source-id: 51c401833df5c1c18caa0f6b82808e1ccb97c76b
Pull Request resolved: #58661
@ezyang
Copy link
Contributor

ezyang commented May 24, 2021

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Freey0 added a commit that referenced this pull request May 25, 2021
ghstack-source-id: 630aaef898ffb726316e14072a11a900dc73d526
Pull Request resolved: #58661
@Freey0
Copy link
Contributor Author

Freey0 commented May 25, 2021

@ezyang Do you know how structured functions work with CompositeImplicitAutograd? I looked into the code base and there doesn't seem to be any examples of it working together at the moment. Any suggestions are greatly appreciated!

@ezyang
Copy link
Contributor

ezyang commented May 25, 2021

They don't. See #50953 The workaround is to NOT use structured, and just redispatch as necessary.

@Freey0
Copy link
Contributor Author

Freey0 commented May 27, 2021

They don't. See #50953 The workaround is to NOT use structured, and just redispatch as necessary.

Sorry, I didn't make that clear. I know that currently CompositeImplicitAutograd and Structured Core don't work together.

But I don't know how to handle the following error.

May 25 12:42:30   File "/var/lib/jenkins/workspace/tools/codegen/model.py", line 342, in from_yaml
May 25 12:42:30     f"unexpected name for singleton CompositeImplicitAutograd dispatch entry: expected {cpp.name(func)} " \
May 25 12:42:30 AssertionError: unexpected name for singleton CompositeImplicitAutograd dispatch entry: expected silu_backward but got math_silu_backward.  Rename your implementation to the expected name, then delete the dispatch table
May 25 12:42:30   in /var/lib/jenkins/workspace/cmake/../aten/src/ATen/native/native_functions.yaml:3555:
May 25 12:42:30     silu_backward(Tensor grad_output, Tensor self) -> Tensor
May 25 12:42:30

How do I preserve the original functionality of math_silu_backward?

Any suggestions are greatly appreciated!

@ezyang
Copy link
Contributor

ezyang commented May 27, 2021

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang
Copy link
Contributor

ezyang commented May 27, 2021

OHHH you're right, this is different, I think you might need a teeny bugfix for this one

@ezyang
Copy link
Contributor

ezyang commented May 27, 2021

diff --git a/aten/src/ATen/test/math_kernel_test.cpp b/aten/src/ATen/test/math_kernel_test.cpp
index 005c11cb0ea..cee59d65cbc 100644
--- a/aten/src/ATen/test/math_kernel_test.cpp
+++ b/aten/src/ATen/test/math_kernel_test.cpp
@@ -1,6 +1,7 @@
 #include <gtest/gtest.h>
 
 #include <ATen/ATen.h>
+#include <ATen/CPUFunctions.h>
 
 using namespace at;
 
@@ -105,7 +106,7 @@ TEST(MathKernelTest, Addr) {
 TEST(MathKernelTest, SiluBackward) {
   const auto input = rand({20, 10});
   const auto grad_output = rand({20, 10});
-  auto out = at::native::silu_backward(grad_output, input);
+  auto out = at::cpu::silu_backward(grad_output, input);
   auto math_out = at::native::math_silu_backward(grad_output, input);
   ASSERT_ALLCLOSE_TOLERANCES(out, math_out, 1e-4, 1e-6);
 }
diff --git a/tools/codegen/model.py b/tools/codegen/model.py
index fba3756fab1..f1ff8a19d0f 100644
--- a/tools/codegen/model.py
+++ b/tools/codegen/model.py
@@ -338,7 +338,9 @@ class NativeFunction:
             assert dispatch != {DispatchKey.CompositeImplicitAutograd: cpp.name(func)}, \
                 "unnecessary dispatch table for this function; just delete the dispatch " \
                 "key entirely"
-            assert dispatch.keys() != {DispatchKey.CompositeImplicitAutograd}, \
+            # if a function is a structured delegate, deleting the dispatch
+            # table is NOT semantics preserving
+            assert structured_delegate or dispatch.keys() != {DispatchKey.CompositeImplicitAutograd}, \
                 f"unexpected name for singleton CompositeImplicitAutograd dispatch entry: expected {cpp.name(func)} " \
                 f"but got {dispatch[DispatchKey.CompositeImplicitAutograd]}.  Rename your implementation to the expected " \
                 "name, then delete the dispatch table"

but also need #58569

Freey0 added a commit that referenced this pull request Jun 20, 2021
ghstack-source-id: a30d9f07b401da481b515612a466e4344bbe2341
Pull Request resolved: #58661
@ezyang
Copy link
Contributor

ezyang commented Jun 21, 2021

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ezyang added a commit that referenced this pull request Jun 21, 2021
ghstack-source-id: 41417bfe769cbd5f58ed6cb7e76203cfb5d87660
Pull Request resolved: #58661
@ezyang
Copy link
Contributor

ezyang commented Jun 21, 2021

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang
Copy link
Contributor

ezyang commented Jun 23, 2021

it would seem the asan failure are real

@ezyang
Copy link
Contributor

ezyang commented Jun 23, 2021

successfully reproduced locally

@ezyang
Copy link
Contributor

ezyang commented Jun 23, 2021

fix coming

ezyang added a commit that referenced this pull request Jun 23, 2021
#58661 induced a static
initialization order fiasco as flagged by ASAN strict_init_order=true.
On further inspection, it became clear that it was not necessary for
these to actually be globals initialized at module load time; so
I converted them into Meyer singletons which ensures they get loaded
immediately when another compilation unit requests them.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

[ghstack-poisoned]
ezyang added a commit that referenced this pull request Jun 23, 2021
#58661 induced a static
initialization order fiasco as flagged by ASAN strict_init_order=true.
On further inspection, it became clear that it was not necessary for
these to actually be globals initialized at module load time; so
I converted them into Meyer singletons which ensures they get loaded
immediately when another compilation unit requests them.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

[ghstack-poisoned]
ezyang added a commit that referenced this pull request Jun 23, 2021
#58661 induced a static
initialization order fiasco as flagged by ASAN strict_init_order=true.
On further inspection, it became clear that it was not necessary for
these to actually be globals initialized at module load time; so
I converted them into Meyer singletons which ensures they get loaded
immediately when another compilation unit requests them.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

[ghstack-poisoned]
ezyang added a commit that referenced this pull request Jun 23, 2021
#58661 induced a static
initialization order fiasco as flagged by ASAN strict_init_order=true.
On further inspection, it became clear that it was not necessary for
these to actually be globals initialized at module load time; so
I converted them into Meyer singletons which ensures they get loaded
immediately when another compilation unit requests them.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: abca5df9cfd71f769509c05c853611a2682e68e8
Pull Request resolved: #60568
@Freey0
Copy link
Contributor Author

Freey0 commented Jun 24, 2021

Thank you very much!

Can you share your debugging experience?
Whenever I see this error, I have no way to start. (Maybe because I am a novice:smile:)

@ezyang
Copy link
Contributor

ezyang commented Jun 24, 2021

Can you share your debugging experience?
Whenever I see this error, I have no way to start. (Maybe because I am a novice

In this particular case, I spent a day getting a local ASAN build and confirming that it indeed failed. (Mostly by staring at https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/build-asan.sh and https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/test.sh ) With debug info, I could get line numbers for this error in CI:

Jun 22 01:03:01 ==79==ERROR: AddressSanitizer: initialization-order-fiasco on address 0x7f7400fcd760 at pc 0x7f73de8343b9 bp 0x7ffdda1bc820 sp 0x7ffdda1bc818
Jun 22 01:03:01 READ of size 8 at 0x7f7400fcd760 thread T0
Jun 22 01:03:03     #0 0x7f73de8343b8 in std::__shared_ptr<c10::OperatorKernel, (__gnu_cxx::_Lock_policy)2>::operator=(std::__shared_ptr<c10::OperatorKernel, (__gnu_cxx::_Lock_policy)2> const&) (/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so+0xdbcb3b8)
Jun 22 01:03:03     #1 0x7f73de82609e in c10::impl::OperatorEntry::updateDispatchTableEntry_(c10::Dispatcher const&, c10::DispatchKey) (/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so+0xdbbd09e)
Jun 22 01:03:03     #2 0x7f73de8244ea in c10::impl::OperatorEntry::updateDispatchTable_(c10::Dispatcher const&, c10::DispatchKey) (/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so+0xdbbb4ea)
Jun 22 01:03:03     #3 0x7f73de8232fd in c10::impl::OperatorEntry::registerKernel(c10::Dispatcher const&, c10::optional<c10::DispatchKey>, c10::KernelFunction, c10::optional<c10::impl::CppSignature>, std::unique_ptr<c10::FunctionSchema, std::default_delete<c10::FunctionSchema> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so+0xdbba2fd)
Jun 22 01:03:03     #4 0x7f73de80b961 in c10::Dispatcher::registerImpl(c10::OperatorName, c10::optional<c10::DispatchKey>, c10::KernelFunction, c10::optional<c10::impl::CppSignature>, std::unique_ptr<c10::FunctionSchema, std::default_delete<c10::FunctionSchema> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so+0xdba2961)
Jun 22 01:03:03     #5 0x7f73de7efb3b in torch::Library::_impl(char const*, torch::CppFunction&&) & (/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so+0xdb86b3b)
Jun 22 01:03:03     #6 0x7f73e10da140 in at::(anonymous namespace)::TORCH_LIBRARY_IMPL_init_aten_Meta_2(torch::Library&) (/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so+0x10471140)

First I had to remind myself what this error message meant. So I looked for the meaning of this error at https://github.com/google/sanitizers/wiki/AddressSanitizerInitializationOrderFiasco#strict-init-order-checking

Then I looked at the code, and tried to figure out what static data updateDispatchTableEntry_ was using, with the line number in hand, and I could see it was accessing an OperatorEntry static variable. I still wasn't sure why this was not kosher, but I knew turning this into a Meyer singleton would work (from previous experience), so I did it, and confirmed in my local ASAN build that it no longer triggered the SIOF ASAN bug.

facebook-github-bot pushed a commit that referenced this pull request Jun 24, 2021
Summary:
Pull Request resolved: #60568

#58661 induced a static
initialization order fiasco as flagged by ASAN strict_init_order=true.
On further inspection, it became clear that it was not necessary for
these to actually be globals initialized at module load time; so
I converted them into Meyer singletons which ensures they get loaded
immediately when another compilation unit requests them.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D29338019

Pulled By: ezyang

fbshipit-source-id: 282846118df6867277404a1830d0ce39fccaa769
ezyang added a commit that referenced this pull request Jun 25, 2021
ghstack-source-id: 763e15f79779482887f9bfa2d2472b1714bf8bcc
Pull Request resolved: #58661
@ezyang
Copy link
Contributor

ezyang commented Jun 25, 2021

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in b52849b.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants