Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jit] Module clone work with shared ClassType #31970

Closed
wants to merge 8 commits into from

Conversation

jerryzh168
Copy link
Contributor

@jerryzh168 jerryzh168 commented Jan 8, 2020

Stack from ghstack:

Summary:
Now that the ClassType can be shared among different module instances, we'll
preserve the sharing in clone as well, that is if the original module has
a ClassType that is shared, we'll clone this ClassType once and share it between
different module instances as well.

Test Plan:
build/test/test_jit

Reviewers:
mvz
Subscribers:

Tasks:

Tags:

Differential Revision: D19406251

Summary:
Now that the ClassType can be shared among different module instances, we'll
preserve the sharing in clone as well, that is if the original module has
a ClassType that is shared, we'll clone this ClassType once and share it between
different module instances as well.

Test Plan:
build/test/test_jit

Reviewers:
mvz
Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
Summary:
Now that the ClassType can be shared among different module instances, we'll
preserve the sharing in clone as well, that is if the original module has
a ClassType that is shared, we'll clone this ClassType once and share it between
different module instances as well.

Test Plan:
build/test/test_jit

Reviewers:
mvz
Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
@kostmo
Copy link
Member

kostmo commented Jan 9, 2020

💊 CircleCI build failures summary and remediations

As of commit 7fdad94:

  • 1/2 failures introduced in this PR
  • 1/2 recognized as flaky ❄️
    • Re-run these jobs?

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

🕵️ 1 new failure recognized by patterns

The following build failures do not appear to be due to upstream breakage:

See CircleCI build pytorch_macos_10_13_py3_test (1/1)

Step: "Test" (full log | pattern match details)

Jan 14 21:10:16 RuntimeError: test_distributed failed!
Jan 14 21:10:15 Ran 92 tests in 336.384s 
Jan 14 21:10:15  
Jan 14 21:10:15 FAILED (failures=1, skipped=19) 
Jan 14 21:10:15  
Jan 14 21:10:15 Generating XML reports... 
Jan 14 21:10:16 Traceback (most recent call last): 
Jan 14 21:10:16   File "test/run_test.py", line 456, in <module> 
Jan 14 21:10:16     main() 
Jan 14 21:10:16   File "test/run_test.py", line 449, in main 
Jan 14 21:10:16     raise RuntimeError(message) 
Jan 14 21:10:16 RuntimeError: test_distributed failed! 
Jan 14 21:10:16 + cleanup 
Jan 14 21:10:16 + retcode=1 
Jan 14 21:10:16 + set +x 

❄️ 1 failure recognized as flaky

The following build failures have been detected as flaky and may not be your fault:

See CircleCI build pytorch_linux_xenial_py3_clang5_mobile_build (1/1)

Step: "Set Up CI Environment After attach_workspace" (full log | pattern match details) ❄️

E: Failed to fetch https://nvidia.github.io/libnvidia-container/ubuntu16.04/amd64/./libnvidia-container1_1.0.5-1_amd64.deb gnutls_handshake() failed: Error in the pull function.
                                                                   92% [Working]               Get:26 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 intel-microcode amd64 3.20191115.1ubuntu0.16.04.2 [2,408 kB] 
                                             95% [Working]               Get:27 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 amd64-microcode amd64 3.20191021.1+really3.20180524.1~ubuntu0.16.04.2 [30.8 kB] 
95% [27 amd64-microcode 30.8 kB/30.8 kB 100%]                      21.4 MB/s 0s 95% [Working]                                                      21.4 MB/s 0s                                                                                 Get:28 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 linux-image-generic amd64 4.4.0.171.179 [2,442 B] 
95% [28 linux-image-generic 2,297 B/2,442 B 94%]                   21.4 MB/s 0s 96% [Working]                                                      21.4 MB/s 0s                                                                                 Get:29 http://archive.ubuntu.com/ubuntu xenial/universe amd64 moreutils amd64 0.57-1 [55.0 kB] 
96% [29 moreutils 2,295 B/55.0 kB 4%]                              21.4 MB/s 0s 96% [Working]                                                      21.4 MB/s 0s                                                                                 Get:30 http://archive.ubuntu.com/ubuntu xenial/main amd64 tcl8.6 amd64 8.6.5+dfsg-2 [14.2 kB] 
96% [30 tcl8.6 2,295 B/14.2 kB 16%]                                21.4 MB/s 0s 97% [Working]                                                      21.4 MB/s 0s                                                                                 Get:31 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 thermald amd64 1.5-2ubuntu4 [187 kB] 
97% [31 thermald 2,293 B/187 kB 1%]                                21.4 MB/s 0s 98% [Working]                                                      21.4 MB/s 0s                                                                                 Get:32 http://archive.ubuntu.com/ubuntu xenial/main amd64 tk8.6 amd64 8.6.5-1 [12.2 kB] 
98% [32 tk8.6 2,295 B/12.2 kB 19%]                                 21.4 MB/s 0s 98% [Working]                                                      21.4 MB/s 0s                                                                                 Get:33 http://archive.ubuntu.com/ubuntu xenial/main amd64 xterm amd64 322-1ubuntu1 [607 kB] 
98% [33 xterm 2,293 B/607 kB 0%]                                   21.4 MB/s 0s 99% [Working]                                                      21.4 MB/s 0s                                                                                 Fetched 131 MB in 7s (18.1 MB/s) 
W: --force-yes is deprecated, use one of the options starting with --allow instead. 
E: Failed to fetch https://nvidia.github.io/libnvidia-container/ubuntu16.04/amd64/./libnvidia-container1_1.0.5-1_amd64.deb  gnutls_handshake() failed: Error in the pull function. 
 
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing? 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 19 times.

@ZolotukhinM
Copy link

We probably need two versions of clone: one that creates a new type and one that uses the old one. E.g. when we clone a module in order to quantize it, we need a new type as we're going to mutate its methods. Inside quantization, however, where we have several stages (insert observers, replace observers with quant-dequant, etc.), when we need a copy of the module we would clone without cloning the type (since we've already created the new type in the very beginning).

@jerryzh168
Copy link
Contributor Author

What is clone without cloning type? are you talking about clone_instance or just don't clone?

@jerryzh168
Copy link
Contributor Author

I think insert_quant_dequant should work without cloning at all, the reason we clone is to invalidate the executor cache, since after insert_observer we'll run the model and the executor is cached.

@ZolotukhinM
Copy link

I think insert_quant_dequant should work without cloning at all, the reason we clone is to invalidate the executor cache, since after insert_observer we'll run the model and the executor is cached.

We would still need to do that, right? If we don't create a new type at that moment, an old graph (with observers rather than with q-dq nodes) will be executed otherwise.

@jerryzh168
Copy link
Contributor Author

I think insert_quant_dequant should work without cloning at all, the reason we clone is to invalidate the executor cache, since after insert_observer we'll run the model and the executor is cached.

We would still need to do that, right? If we don't create a new type at that moment, an old graph (with observers rather than with q-dq nodes) will be executed otherwise.

Yeah, that's what we do right now, we'll clone in both insert_observers and insert_quant_dequant

Summary:
Now that the ClassType can be shared among different module instances, we'll
preserve the sharing in clone as well, that is if the original module has
a ClassType that is shared, we'll clone this ClassType once and share it between
different module instances as well.

Test Plan:
build/test/test_jit

Reviewers:
mvz
Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
Summary:
Now that the ClassType can be shared among different module instances, we'll
preserve the sharing in clone as well, that is if the original module has
a ClassType that is shared, we'll clone this ClassType once and share it between
different module instances as well.

Test Plan:
build/test/test_jit

Reviewers:
mvz
Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
Summary:
Now that the ClassType can be shared among different module instances, we'll
preserve the sharing in clone as well, that is if the original module has
a ClassType that is shared, we'll clone this ClassType once and share it between
different module instances as well.

Test Plan:
build/test/test_jit

Reviewers:
mvz
Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
@jerryzh168
Copy link
Contributor Author

We also discovered that the original clone does not work anymore after the shared ClassType change.
If a module contains two instances that shares the same type, e.g.

class M(torch.nn.Module):
    def __init__(self):
        super(M, self).__init__()
        self.fc1 = torch.nn.Linear(5, 5)
        self.fc2 = torch.nn.Linear(5, 5)
    
    def forward(self, x):
           return self.fc1(self.fc2(x))
a = torch.jit.script(M())
a = a.copy()
print(a._c.dump_to_str())

they will have different type in the module but same type in the forward function:

module __torch__.___torch_mangle_268.M {
  parameters {
  }
  attributes {
    training = True
    fc1 = <__torch__.torch.nn.modules.linear.___torch_mangle_269.Linear object at 0x7fe2a0658870>
    fc2 = <__torch__.torch.nn.modules.linear.___torch_mangle_270.Linear object at 0x7fe2a0658640>
  }
  methods {
    method forward {
      graph(%self : __torch__.___torch_mangle_268.M,
            %x.1 : Tensor):
        %2 : __torch__.torch.nn.modules.linear.___torch_mangle_270.Linear = prim::GetAttr[name="fc1"](%self)
        %3 : __torch__.torch.nn.modules.linear.___torch_mangle_270.Linear = prim::GetAttr[name="fc2"](%self)
        %4 : Tensor = prim::CallMethod[name="forward"](%3, %x.1) # <ipython-input-67-70905ca9c75f>:8:27
        %5 : Tensor = prim::CallMethod[name="forward"](%2, %4) # <ipython-input-67-70905ca9c75f>:8:18
        return (%5)
  
    }
  }

Copy link

@ZolotukhinM ZolotukhinM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, please find a couple of comments inline!

torch/csrc/jit/script/module.cpp Show resolved Hide resolved
"Temporarily skip the test since we don't have"
"support for different quantization configurations for shared"
"ClassType right now, sub1.fc and fc3 shares the ClassType but for"
" sub1.fc qconfig is None, and fc3 is quantized with default_qconfig")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this test working correctly before? If not, it's better to split this change into a separate commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is working before since we clone will create new type for all types so we don't have type sharing after clone previously.

Summary:
Now that the ClassType can be shared among different module instances, we'll
preserve the sharing in clone as well, that is if the original module has
a ClassType that is shared, we'll clone this ClassType once and share it between
different module instances as well.

Test Plan:
build/test/test_jit

Reviewers:
mvz
Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
Summary:
Now that the ClassType can be shared among different module instances, we'll
preserve the sharing in clone as well, that is if the original module has
a ClassType that is shared, we'll clone this ClassType once and share it between
different module instances as well.

Test Plan:
build/test/test_jit

Reviewers:
mvz
Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
Summary:
Now that the ClassType can be shared among different module instances, we'll
preserve the sharing in clone as well, that is if the original module has
a ClassType that is shared, we'll clone this ClassType once and share it between
different module instances as well.

Test Plan:
build/test/test_jit

Reviewers:
mvz
Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 4314620.

@facebook-github-bot facebook-github-bot deleted the gh/jerryzh168/179/head branch January 19, 2020 15:16
wuhuikx pushed a commit to wuhuikx/pytorch that referenced this pull request Jan 30, 2020
Summary:
Pull Request resolved: pytorch#31970

Now that the ClassType can be shared among different module instances, we'll
preserve the sharing in clone as well, that is if the original module has
a ClassType that is shared, we'll clone this ClassType once and share it between
different module instances as well.

Test Plan:
build/test/test_jit

Imported from OSS

Differential Revision: D19406251

fbshipit-source-id: 2881c695f6e718e5432040a3817cf187a62017bf
ttumiel pushed a commit to ttumiel/pytorch that referenced this pull request Mar 4, 2020
Summary:
Pull Request resolved: pytorch#31970

Now that the ClassType can be shared among different module instances, we'll
preserve the sharing in clone as well, that is if the original module has
a ClassType that is shared, we'll clone this ClassType once and share it between
different module instances as well.

Test Plan:
build/test/test_jit

Imported from OSS

Differential Revision: D19406251

fbshipit-source-id: 2881c695f6e718e5432040a3817cf187a62017bf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Merged oncall: jit Add this issue/PR to JIT oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants