Support composite of lowered sub modules of the same backend #59921

iseeyuan · 2021-06-13T17:42:55Z

Stack from ghstack:

Support composite of lowered sub modules of the same backend #59921 [Delegate] Support composite of lowered sub modules of the same backend

For a certain backend, the lowered models has a fixed name as "torch.jit." + backend_name + "_LoweredModule". There is an issue of composite situations, where two different submodules are both lowered to the same backend. The submodule names are identical.

It causes a bug in bytecode serialization, where the module names are not mangled (a follow up PR could be put to mangle the names to make them unique). As a result, the __setstate__ method is only serialized for one submodule because of the name conflict. When loading other modules, the corresponding __setstate__ cannot be found and run. The sub module is loaded as an ordinary nn module with properties in a dictionary, causing crash with error message, ""Expected GenericDict but got Tuple".

In this PR it's fixed by adding the submodule's original (unique) qualified name to the lowered module name. It's also good for human understanding and debugging purposes.

Test:
Added unit test of BackendTest.TestComposite
CI

Differential Revision: D29091143

[ghstack-poisoned]

facebook-github-bot · 2021-06-13T17:43:01Z

💊 CI failures summary and remediations

As of commit 3e9657c (more details on the Dr. CI page and at hud.pytorch.org/pr/59921):

2/3 failures possibly* introduced in this PR
- 1/2 non-scanned failure(s)
1/3 tentatively recognized as flaky ❄️
- Click here to rerun these jobs

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_py3_clang5_asan_test2 (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 25 04:25:21 RuntimeError: test_unary_ufuncs failed!

Jun 25 04:25:21     #172 0x56018d032196 in main /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Programs/python.c:69
Jun 25 04:25:21     #173 0x7fe2a1f6c83f in __libc_start_main /build/glibc-S7Ft5T/glibc-2.23/csu/../csu/libc-start.c:291
Jun 25 04:25:21     #174 0x56018d0c233d in _start (/opt/conda/bin/python3.6+0x1a733d)
Jun 25 04:25:21 
Jun 25 04:25:21 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/native/Math.h:217:17 in 
Jun 25 04:25:21 Traceback (most recent call last):
Jun 25 04:25:21   File "test/run_test.py", line 1310, in <module>
Jun 25 04:25:21     main()
Jun 25 04:25:21   File "test/run_test.py", line 1289, in main
Jun 25 04:25:21     raise RuntimeError(err_message)
Jun 25 04:25:21 RuntimeError: test_unary_ufuncs failed!
Jun 25 04:25:22 + cleanup
Jun 25 04:25:22 + retcode=1
Jun 25 04:25:22 + set +x
Jun 25 04:25:22 =================== sccache compilation log ===================
Jun 25 04:25:22 =========== If your build fails, please take a look at the log above for possible reasons ===========
Jun 25 04:25:22 Compile requests                      2
Jun 25 04:25:22 Compile requests executed             0
Jun 25 04:25:22 Cache hits                            0
Jun 25 04:25:22 Cache misses                          0
Jun 25 04:25:22 Cache timeouts                        0

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test2 (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Jun 25 05:08:24 unknown file: Failure

Jun 25 05:08:24 frame #7: build/bin/test_api() [0xc0b4d5]
Jun 25 05:08:24 frame #8: build/bin/test_api() [0xc0b775]
Jun 25 05:08:24 frame #9: testing::internal::UnitTestImpl::RunAllTests() + 0xbf9 (0xc0c7b9 in build/bin/test_api)
Jun 25 05:08:24 frame #10: testing::UnitTest::Run() + 0x8f (0xc0ca5f in build/bin/test_api)
Jun 25 05:08:24 frame #11: main + 0xc8 (0x5833a8 in build/bin/test_api)
Jun 25 05:08:24 frame #12: __libc_start_main + 0xf0 (0x7f52e642e840 in /lib/x86_64-linux-gnu/libc.so.6)
Jun 25 05:08:24 frame #13: _start + 0x29 (0x5b9a19 in build/bin/test_api)
Jun 25 05:08:24 " thrown in the test body.
Jun 25 05:08:24 [  FAILED  ] IntegrationTest.MNIST_CUDA (4 ms)
Jun 25 05:08:24 [ RUN      ] IntegrationTest.MNISTBatchNorm_CUDA
Jun 25 05:08:24 unknown file: Failure
Jun 25 05:08:24 C++ exception with description "Error opening images file at test/cpp/api/mnist/train-images-idx3-ubyte
Jun 25 05:08:24 Exception raised from read_images at /var/lib/jenkins/workspace/torch/csrc/api/src/data/datasets/mnist.cpp:67 (most recent call first):
Jun 25 05:08:24 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7f52ffc998cb in /var/lib/jenkins/workspace/build/lib/libc10.so)
Jun 25 05:08:24 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xce (0x7f52ffc950de in /var/lib/jenkins/workspace/build/lib/libc10.so)
Jun 25 05:08:24 frame #2: <unknown function> + 0x4223302 (0x7f5304308302 in /var/lib/jenkins/workspace/build/lib/libtorch_cpu.so)
Jun 25 05:08:24 frame #3: torch::data::datasets::MNIST::MNIST(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::data::datasets::MNIST::Mode) + 0x46 (0x7f53043093a6 in /var/lib/jenkins/workspace/build/lib/libtorch_cpu.so)
Jun 25 05:08:24 frame #4: IntegrationTest_MNISTBatchNorm_CUDA_Test::TestBody() + 0x9d6 (0x783fc6 in build/bin/test_api)
Jun 25 05:08:24 frame #5: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 0x4a (0xc144aa in build/bin/test_api)
Jun 25 05:08:24 frame #6: build/bin/test_api() [0xc0aee6]
Jun 25 05:08:24 frame #7: build/bin/test_api() [0xc0b4d5]

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

ghstack-source-id: e42a234 Pull Request resolved: #59921

iseeyuan · 2021-06-13T18:48:58Z

@iseeyuan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

kimishpatel · 2021-06-14T16:24:42Z

torch/csrc/jit/backends/backend_detail.cpp

  // Generate LoweredModule.
  Module loweredModule(
-      "torch.jit." + backend_name + "LoweredModule",
+      "torch.jit." + backend_name + "_" + module_name + "_LoweredModule",


So are we sure that two instances of the same module type will not get lowered separately? What happens if that is the case? We will run into the same issue, no? I think we should probably use some unique id as well.

Yeah, I think we also discussed this back then (the need some unique ID).

I think the qualified name of a module is unique. @suo and @SplitInfinity , could you confirm?

I dont think thats the case. You can instantiate one module type with many instances.

I think what @kimishpatel means is that you can have a parent module with 2 submodules of the same type.
So the submodule will indeed have a unique name, but then it will show up twice in the parent module.

If that has a problem then we need to mangle another unique id; otherwise, then what you did is enough.

Let me quickly confirm "a parent module with 2 submodules of the same type" case and see if they have the same qualified name. Thanks @raziel and @kimishpatel !

After diving deeper, there are actually two issues:

The issue for delegation that this PR is to resolve. Basically, a new Module is constructed here, with a class name as the first argument of the constructor. This class name should reflect the original class name. After this fix, different classes lowered to the same backend would also have different class names.

General issue of bytecode serialization. For the same original class name but different instances (@kimishpatel and @raziel raised here). It is handled in TorchScript serialization, by mangling the names in a TypeNameUniquer. However, it's not mangled when serializing bytecode. As a result, there is discrepancy between the names in TS and in bytecode. This discrepancy appears only once at the first serialization. When it's loaded again the mangled name would be taken from bytecode. T93782563 is created to follow up this issue.

So for issue #1 we need to resolve it anyway I think this PR is still valid for that. The unique id for an instance is a separate issue and can be addressed in a centralized place in bytecode serialization.

Thanks.
"For the same original class name but different instances. It is handled in TorchScript serialization, by mangling the names in a TypeNameUniquer. However, it's not mangled when serializing bytecode."
So this is a general issue in TS that potentially affects any custom class?

So this is a general issue in TS that potentially affects any custom class?

I think so. Not only for custom classes but for all class types. TS has handled it using type_name_uniquer_ in ScriptModuleSerializer. I think bytecode needs to reflect the same pattern.

kimishpatel

Wondering if we need to add some unique id to it.

raziel

Thanks
For what you said the remaining issue (at the instance level) should be solved somewhere else in the code, and this fix (at the type/class level) still makes sense.

… same backend" For a certain backend, the lowered models has a fixed name as `"torch.jit." + backend_name + "_LoweredModule"`. There is an issue of composite situations, where two different submodules are both lowered to the same backend. The submodule names are identical. It causes a bug in bytecode serialization, where the module names are not mangled (a follow up PR could be put to mangle the names to make them unique). As a result, the `__setstate__` method is only serialized for one submodule because of the name conflict. When loading other modules, the corresponding `__setstate__` cannot be found and run. The sub module is loaded as an ordinary nn module with properties in a dictionary, causing crash with error message, ""Expected GenericDict but got Tuple". In this PR it's fixed by adding the submodule's original (unique) qualified name to the lowered module name. It's also good for human understanding and debugging purposes. Test: Added unit test of `BackendTest.TestComposite` CI Differential Revision: [D29091143](https://our.internmc.facebook.com/intern/diff/D29091143) [ghstack-poisoned]

ghstack-source-id: 6311d23 Pull Request resolved: #59921

iseeyuan · 2021-06-21T21:09:37Z

@iseeyuan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

… same backend" For a certain backend, the lowered models has a fixed name as `"torch.jit." + backend_name + "_LoweredModule"`. There is an issue of composite situations, where two different submodules are both lowered to the same backend. The submodule names are identical. It causes a bug in bytecode serialization, where the module names are not mangled (a follow up PR could be put to mangle the names to make them unique). As a result, the `__setstate__` method is only serialized for one submodule because of the name conflict. When loading other modules, the corresponding `__setstate__` cannot be found and run. The sub module is loaded as an ordinary nn module with properties in a dictionary, causing crash with error message, ""Expected GenericDict but got Tuple". In this PR it's fixed by adding the submodule's original (unique) qualified name to the lowered module name. It's also good for human understanding and debugging purposes. Test: Added unit test of `BackendTest.TestComposite` CI Differential Revision: [D29091143](https://our.internmc.facebook.com/intern/diff/D29091143) [ghstack-poisoned]

ghstack-source-id: ef5a6cb Pull Request resolved: #59921

iseeyuan · 2021-06-22T14:25:10Z

@iseeyuan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

… same backend" For a certain backend, the lowered models has a fixed name as `"torch.jit." + backend_name + "_LoweredModule"`. There is an issue of composite situations, where two different submodules are both lowered to the same backend. The submodule names are identical. It causes a bug in bytecode serialization, where the module names are not mangled (a follow up PR could be put to mangle the names to make them unique). As a result, the `__setstate__` method is only serialized for one submodule because of the name conflict. When loading other modules, the corresponding `__setstate__` cannot be found and run. The sub module is loaded as an ordinary nn module with properties in a dictionary, causing crash with error message, ""Expected GenericDict but got Tuple". In this PR it's fixed by adding the submodule's original (unique) qualified name to the lowered module name. It's also good for human understanding and debugging purposes. Test: Added unit test of `BackendTest.TestComposite` CI Differential Revision: [D29091143](https://our.internmc.facebook.com/intern/diff/D29091143) [ghstack-poisoned]

ghstack-source-id: 4090240 Pull Request resolved: #59921

iseeyuan · 2021-06-24T04:42:41Z

@iseeyuan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

… same backend" For a certain backend, the lowered models has a fixed name as `"torch.jit." + backend_name + "_LoweredModule"`. There is an issue of composite situations, where two different submodules are both lowered to the same backend. The submodule names are identical. It causes a bug in bytecode serialization, where the module names are not mangled (a follow up PR could be put to mangle the names to make them unique). As a result, the `__setstate__` method is only serialized for one submodule because of the name conflict. When loading other modules, the corresponding `__setstate__` cannot be found and run. The sub module is loaded as an ordinary nn module with properties in a dictionary, causing crash with error message, ""Expected GenericDict but got Tuple". In this PR it's fixed by adding the submodule's original (unique) qualified name to the lowered module name. It's also good for human understanding and debugging purposes. Test: Added unit test of `BackendTest.TestComposite` CI Differential Revision: [D29091143](https://our.internmc.facebook.com/intern/diff/D29091143) [ghstack-poisoned]

ghstack-source-id: 2d2163d Pull Request resolved: #59921

… same backend" For a certain backend, the lowered models has a fixed name as `"torch.jit." + backend_name + "_LoweredModule"`. There is an issue of composite situations, where two different submodules are both lowered to the same backend. The submodule names are identical. It causes a bug in bytecode serialization, where the module names are not mangled (a follow up PR could be put to mangle the names to make them unique). As a result, the `__setstate__` method is only serialized for one submodule because of the name conflict. When loading other modules, the corresponding `__setstate__` cannot be found and run. The sub module is loaded as an ordinary nn module with properties in a dictionary, causing crash with error message, ""Expected GenericDict but got Tuple". In this PR it's fixed by adding the submodule's original (unique) qualified name to the lowered module name. It's also good for human understanding and debugging purposes. Test: Added unit test of `BackendTest.TestComposite` CI Differential Revision: [D29091143](https://our.internmc.facebook.com/intern/diff/D29091143) [ghstack-poisoned]

ghstack-source-id: c061445 Pull Request resolved: #59921

iseeyuan · 2021-06-25T02:36:06Z

@iseeyuan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-06-25T14:20:13Z

@iseeyuan merged this pull request in d8c3d55.

…nd (pytorch#59921) Summary: Pull Request resolved: pytorch#59921 Test Plan: Imported from OSS Reviewed By: raziel Differential Revision: D29091143 Pulled By: iseeyuan fbshipit-source-id: 9ffcd18681917ece8ec73a34866c53701bdee1bc

…nd (#59921) Summary: Pull Request resolved: #59921 Test Plan: Imported from OSS Reviewed By: raziel Differential Revision: D29091143 Pulled By: iseeyuan fbshipit-source-id: 9ffcd18681917ece8ec73a34866c53701bdee1bc

[Delegate] Support composite of lowered sub modules of the same backend

2cb99f6

[ghstack-poisoned]

facebook-github-bot added oncall: jit Add this issue/PR to JIT oncall triage queue cla signed labels Jun 13, 2021

iseeyuan added a commit that referenced this pull request Jun 13, 2021

[Delegate] Support composite of lowered sub modules of the same backend

3023a45

ghstack-source-id: e42a234 Pull Request resolved: #59921

iseeyuan requested review from SplitInfinity, kimishpatel, raziel and suo June 14, 2021 16:01

kimishpatel reviewed Jun 14, 2021

View reviewed changes

kimishpatel requested changes Jun 14, 2021

View reviewed changes

iseeyuan requested a review from kimishpatel June 21, 2021 18:57

raziel approved these changes Jun 21, 2021

View reviewed changes

iseeyuan added a commit that referenced this pull request Jun 21, 2021

[Delegate] Support composite of lowered sub modules of the same backend

f8cc87d

ghstack-source-id: 6311d23 Pull Request resolved: #59921

iseeyuan added a commit that referenced this pull request Jun 22, 2021

[Delegate] Support composite of lowered sub modules of the same backend

9293bf6

ghstack-source-id: ef5a6cb Pull Request resolved: #59921

iseeyuan added a commit that referenced this pull request Jun 23, 2021

[Delegate] Support composite of lowered sub modules of the same backend

97a9f48

ghstack-source-id: 4090240 Pull Request resolved: #59921

iseeyuan added a commit that referenced this pull request Jun 24, 2021

[Delegate] Support composite of lowered sub modules of the same backend

ae98a3f

ghstack-source-id: 2d2163d Pull Request resolved: #59921

iseeyuan added a commit that referenced this pull request Jun 25, 2021

[Delegate] Support composite of lowered sub modules of the same backend

613a71a

ghstack-source-id: c061445 Pull Request resolved: #59921

facebook-github-bot closed this in d8c3d55 Jun 25, 2021

facebook-github-bot added the Merged label Jun 25, 2021

facebook-github-bot deleted the gh/iseeyuan/124/head branch June 29, 2021 14:22

tugsbayasgalan changed the title ~~[Delegate] Support composite of lowered sub modules of the same backend~~ Support composite of lowered sub modules of the same backend Oct 18, 2021

Support composite of lowered sub modules of the same backend #59921

Support composite of lowered sub modules of the same backend #59921

Uh oh!

Conversation

iseeyuan commented Jun 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jun 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_linux_xenial_py3_clang5_asan_test2 (1/1)

❄️ 1 failure tentatively classified as flaky

pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test2 (1/1)

Uh oh!

iseeyuan commented Jun 13, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raziel Jun 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kimishpatel left a comment

Choose a reason for hiding this comment

Uh oh!

raziel left a comment

Choose a reason for hiding this comment

Uh oh!

iseeyuan commented Jun 21, 2021

Uh oh!

iseeyuan commented Jun 22, 2021

Uh oh!

iseeyuan commented Jun 24, 2021

Uh oh!

iseeyuan commented Jun 25, 2021

Uh oh!

facebook-github-bot commented Jun 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

iseeyuan commented Jun 13, 2021 •

edited

Loading

facebook-github-bot commented Jun 13, 2021 •

edited

Loading

raziel Jun 21, 2021 •

edited

Loading