Skip to content

Conversation

iseeyuan
Copy link
Contributor

@iseeyuan iseeyuan commented Jun 13, 2021

Stack from ghstack:

For a certain backend, the lowered models has a fixed name as "torch.jit." + backend_name + "_LoweredModule". There is an issue of composite situations, where two different submodules are both lowered to the same backend. The submodule names are identical.

It causes a bug in bytecode serialization, where the module names are not mangled (a follow up PR could be put to mangle the names to make them unique). As a result, the __setstate__ method is only serialized for one submodule because of the name conflict. When loading other modules, the corresponding __setstate__ cannot be found and run. The sub module is loaded as an ordinary nn module with properties in a dictionary, causing crash with error message, ""Expected GenericDict but got Tuple".

In this PR it's fixed by adding the submodule's original (unique) qualified name to the lowered module name. It's also good for human understanding and debugging purposes.

Test:
Added unit test of BackendTest.TestComposite
CI

Differential Revision: D29091143

@facebook-github-bot facebook-github-bot added oncall: jit Add this issue/PR to JIT oncall triage queue cla signed labels Jun 13, 2021
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jun 13, 2021

💊 CI failures summary and remediations

As of commit 3e9657c (more details on the Dr. CI page and at hud.pytorch.org/pr/59921):



🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_clang5_asan_test2 (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 25 04:25:21 RuntimeError: test_unary_ufuncs failed!
Jun 25 04:25:21     #172 0x56018d032196 in main /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Programs/python.c:69
Jun 25 04:25:21     #173 0x7fe2a1f6c83f in __libc_start_main /build/glibc-S7Ft5T/glibc-2.23/csu/../csu/libc-start.c:291
Jun 25 04:25:21     #174 0x56018d0c233d in _start (/opt/conda/bin/python3.6+0x1a733d)
Jun 25 04:25:21 
Jun 25 04:25:21 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/native/Math.h:217:17 in 
Jun 25 04:25:21 Traceback (most recent call last):
Jun 25 04:25:21   File "test/run_test.py", line 1310, in <module>
Jun 25 04:25:21     main()
Jun 25 04:25:21   File "test/run_test.py", line 1289, in main
Jun 25 04:25:21     raise RuntimeError(err_message)
Jun 25 04:25:21 RuntimeError: test_unary_ufuncs failed!
Jun 25 04:25:22 + cleanup
Jun 25 04:25:22 + retcode=1
Jun 25 04:25:22 + set +x
Jun 25 04:25:22 =================== sccache compilation log ===================
Jun 25 04:25:22 =========== If your build fails, please take a look at the log above for possible reasons ===========
Jun 25 04:25:22 Compile requests                      2
Jun 25 04:25:22 Compile requests executed             0
Jun 25 04:25:22 Cache hits                            0
Jun 25 04:25:22 Cache misses                          0
Jun 25 04:25:22 Cache timeouts                        0

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test2 (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Jun 25 05:08:24 unknown file: Failure
Jun 25 05:08:24 frame #7: build/bin/test_api() [0xc0b4d5]
Jun 25 05:08:24 frame #8: build/bin/test_api() [0xc0b775]
Jun 25 05:08:24 frame #9: testing::internal::UnitTestImpl::RunAllTests() + 0xbf9 (0xc0c7b9 in build/bin/test_api)
Jun 25 05:08:24 frame #10: testing::UnitTest::Run() + 0x8f (0xc0ca5f in build/bin/test_api)
Jun 25 05:08:24 frame #11: main + 0xc8 (0x5833a8 in build/bin/test_api)
Jun 25 05:08:24 frame #12: __libc_start_main + 0xf0 (0x7f52e642e840 in /lib/x86_64-linux-gnu/libc.so.6)
Jun 25 05:08:24 frame #13: _start + 0x29 (0x5b9a19 in build/bin/test_api)
Jun 25 05:08:24 " thrown in the test body.
Jun 25 05:08:24 [  FAILED  ] IntegrationTest.MNIST_CUDA (4 ms)
Jun 25 05:08:24 [ RUN      ] IntegrationTest.MNISTBatchNorm_CUDA
Jun 25 05:08:24 unknown file: Failure
Jun 25 05:08:24 C++ exception with description "Error opening images file at test/cpp/api/mnist/train-images-idx3-ubyte
Jun 25 05:08:24 Exception raised from read_images at /var/lib/jenkins/workspace/torch/csrc/api/src/data/datasets/mnist.cpp:67 (most recent call first):
Jun 25 05:08:24 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7f52ffc998cb in /var/lib/jenkins/workspace/build/lib/libc10.so)
Jun 25 05:08:24 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xce (0x7f52ffc950de in /var/lib/jenkins/workspace/build/lib/libc10.so)
Jun 25 05:08:24 frame #2: <unknown function> + 0x4223302 (0x7f5304308302 in /var/lib/jenkins/workspace/build/lib/libtorch_cpu.so)
Jun 25 05:08:24 frame #3: torch::data::datasets::MNIST::MNIST(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::data::datasets::MNIST::Mode) + 0x46 (0x7f53043093a6 in /var/lib/jenkins/workspace/build/lib/libtorch_cpu.so)
Jun 25 05:08:24 frame #4: IntegrationTest_MNISTBatchNorm_CUDA_Test::TestBody() + 0x9d6 (0x783fc6 in build/bin/test_api)
Jun 25 05:08:24 frame #5: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 0x4a (0xc144aa in build/bin/test_api)
Jun 25 05:08:24 frame #6: build/bin/test_api() [0xc0aee6]
Jun 25 05:08:24 frame #7: build/bin/test_api() [0xc0b4d5]

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

iseeyuan added a commit that referenced this pull request Jun 13, 2021
@iseeyuan
Copy link
Contributor Author

@iseeyuan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

// Generate LoweredModule.
Module loweredModule(
"torch.jit." + backend_name + "LoweredModule",
"torch.jit." + backend_name + "_" + module_name + "_LoweredModule",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So are we sure that two instances of the same module type will not get lowered separately? What happens if that is the case? We will run into the same issue, no? I think we should probably use some unique id as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we also discussed this back then (the need some unique ID).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the qualified name of a module is unique. @suo and @SplitInfinity , could you confirm?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think thats the case. You can instantiate one module type with many instances.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what @kimishpatel means is that you can have a parent module with 2 submodules of the same type.
So the submodule will indeed have a unique name, but then it will show up twice in the parent module.

If that has a problem then we need to mangle another unique id; otherwise, then what you did is enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me quickly confirm "a parent module with 2 submodules of the same type" case and see if they have the same qualified name. Thanks @raziel and @kimishpatel !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After diving deeper, there are actually two issues:

  1. The issue for delegation that this PR is to resolve. Basically, a new Module is constructed here, with a class name as the first argument of the constructor. This class name should reflect the original class name. After this fix, different classes lowered to the same backend would also have different class names.
  2. General issue of bytecode serialization. For the same original class name but different instances (@kimishpatel and @raziel raised here). It is handled in TorchScript serialization, by mangling the names in a TypeNameUniquer. However, it's not mangled when serializing bytecode. As a result, there is discrepancy between the names in TS and in bytecode. This discrepancy appears only once at the first serialization. When it's loaded again the mangled name would be taken from bytecode. T93782563 is created to follow up this issue.

So for issue #1 we need to resolve it anyway I think this PR is still valid for that. The unique id for an instance is a separate issue and can be addressed in a centralized place in bytecode serialization.

Copy link
Contributor

@raziel raziel Jun 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
"For the same original class name but different instances. It is handled in TorchScript serialization, by mangling the names in a TypeNameUniquer. However, it's not mangled when serializing bytecode."
So this is a general issue in TS that potentially affects any custom class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is a general issue in TS that potentially affects any custom class?

I think so. Not only for custom classes but for all class types. TS has handled it using type_name_uniquer_ in ScriptModuleSerializer. I think bytecode needs to reflect the same pattern.

Copy link
Contributor

@kimishpatel kimishpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we need to add some unique id to it.

@iseeyuan iseeyuan requested a review from kimishpatel June 21, 2021 18:57
Copy link
Contributor

@raziel raziel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks
For what you said the remaining issue (at the instance level) should be solved somewhere else in the code, and this fix (at the type/class level) still makes sense.

… same backend"


For a certain backend, the lowered models has a fixed name as `"torch.jit." + backend_name  + "_LoweredModule"`. There is an issue of composite situations, where two different submodules are both lowered to the same backend. The submodule names are identical. 

It causes a bug in bytecode serialization, where the module names are not mangled (a follow up PR could be put to mangle the names to make them unique). As a result, the `__setstate__` method is only serialized for one submodule because of the name conflict. When loading other modules, the corresponding `__setstate__` cannot be found and run. The sub module is loaded as an ordinary nn module with properties in a dictionary, causing crash with error message, ""Expected GenericDict but got Tuple".  

In this PR it's fixed by adding the submodule's original (unique) qualified name to the lowered module name. It's also good for human understanding and debugging purposes. 

Test: 
Added unit test of `BackendTest.TestComposite`
CI

Differential Revision: [D29091143](https://our.internmc.facebook.com/intern/diff/D29091143)

[ghstack-poisoned]
iseeyuan added a commit that referenced this pull request Jun 21, 2021
@iseeyuan
Copy link
Contributor Author

@iseeyuan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

… same backend"


For a certain backend, the lowered models has a fixed name as `"torch.jit." + backend_name  + "_LoweredModule"`. There is an issue of composite situations, where two different submodules are both lowered to the same backend. The submodule names are identical. 

It causes a bug in bytecode serialization, where the module names are not mangled (a follow up PR could be put to mangle the names to make them unique). As a result, the `__setstate__` method is only serialized for one submodule because of the name conflict. When loading other modules, the corresponding `__setstate__` cannot be found and run. The sub module is loaded as an ordinary nn module with properties in a dictionary, causing crash with error message, ""Expected GenericDict but got Tuple".  

In this PR it's fixed by adding the submodule's original (unique) qualified name to the lowered module name. It's also good for human understanding and debugging purposes. 

Test: 
Added unit test of `BackendTest.TestComposite`
CI

Differential Revision: [D29091143](https://our.internmc.facebook.com/intern/diff/D29091143)

[ghstack-poisoned]
iseeyuan added a commit that referenced this pull request Jun 22, 2021
@iseeyuan
Copy link
Contributor Author

@iseeyuan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

… same backend"


For a certain backend, the lowered models has a fixed name as `"torch.jit." + backend_name  + "_LoweredModule"`. There is an issue of composite situations, where two different submodules are both lowered to the same backend. The submodule names are identical. 

It causes a bug in bytecode serialization, where the module names are not mangled (a follow up PR could be put to mangle the names to make them unique). As a result, the `__setstate__` method is only serialized for one submodule because of the name conflict. When loading other modules, the corresponding `__setstate__` cannot be found and run. The sub module is loaded as an ordinary nn module with properties in a dictionary, causing crash with error message, ""Expected GenericDict but got Tuple".  

In this PR it's fixed by adding the submodule's original (unique) qualified name to the lowered module name. It's also good for human understanding and debugging purposes. 

Test: 
Added unit test of `BackendTest.TestComposite`
CI

Differential Revision: [D29091143](https://our.internmc.facebook.com/intern/diff/D29091143)

[ghstack-poisoned]
iseeyuan added a commit that referenced this pull request Jun 23, 2021
@iseeyuan
Copy link
Contributor Author

@iseeyuan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

… same backend"


For a certain backend, the lowered models has a fixed name as `"torch.jit." + backend_name  + "_LoweredModule"`. There is an issue of composite situations, where two different submodules are both lowered to the same backend. The submodule names are identical. 

It causes a bug in bytecode serialization, where the module names are not mangled (a follow up PR could be put to mangle the names to make them unique). As a result, the `__setstate__` method is only serialized for one submodule because of the name conflict. When loading other modules, the corresponding `__setstate__` cannot be found and run. The sub module is loaded as an ordinary nn module with properties in a dictionary, causing crash with error message, ""Expected GenericDict but got Tuple".  

In this PR it's fixed by adding the submodule's original (unique) qualified name to the lowered module name. It's also good for human understanding and debugging purposes. 

Test: 
Added unit test of `BackendTest.TestComposite`
CI

Differential Revision: [D29091143](https://our.internmc.facebook.com/intern/diff/D29091143)

[ghstack-poisoned]
iseeyuan added a commit that referenced this pull request Jun 24, 2021
… same backend"


For a certain backend, the lowered models has a fixed name as `"torch.jit." + backend_name  + "_LoweredModule"`. There is an issue of composite situations, where two different submodules are both lowered to the same backend. The submodule names are identical. 

It causes a bug in bytecode serialization, where the module names are not mangled (a follow up PR could be put to mangle the names to make them unique). As a result, the `__setstate__` method is only serialized for one submodule because of the name conflict. When loading other modules, the corresponding `__setstate__` cannot be found and run. The sub module is loaded as an ordinary nn module with properties in a dictionary, causing crash with error message, ""Expected GenericDict but got Tuple".  

In this PR it's fixed by adding the submodule's original (unique) qualified name to the lowered module name. It's also good for human understanding and debugging purposes. 

Test: 
Added unit test of `BackendTest.TestComposite`
CI

Differential Revision: [D29091143](https://our.internmc.facebook.com/intern/diff/D29091143)

[ghstack-poisoned]
iseeyuan added a commit that referenced this pull request Jun 25, 2021
@iseeyuan
Copy link
Contributor Author

@iseeyuan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@iseeyuan merged this pull request in d8c3d55.

asuhan pushed a commit to asuhan/pytorch that referenced this pull request Jun 28, 2021
…nd (pytorch#59921)

Summary: Pull Request resolved: pytorch#59921

Test Plan: Imported from OSS

Reviewed By: raziel

Differential Revision: D29091143

Pulled By: iseeyuan

fbshipit-source-id: 9ffcd18681917ece8ec73a34866c53701bdee1bc
@facebook-github-bot facebook-github-bot deleted the gh/iseeyuan/124/head branch June 29, 2021 14:22
asuhan pushed a commit that referenced this pull request Jun 30, 2021
…nd (#59921)

Summary: Pull Request resolved: #59921

Test Plan: Imported from OSS

Reviewed By: raziel

Differential Revision: D29091143

Pulled By: iseeyuan

fbshipit-source-id: 9ffcd18681917ece8ec73a34866c53701bdee1bc
@tugsbayasgalan tugsbayasgalan changed the title [Delegate] Support composite of lowered sub modules of the same backend Support composite of lowered sub modules of the same backend Oct 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants