Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ZeroRedundancyOptimizer] Pytorch compliant state #52960

Closed
wants to merge 1 commit into from

Conversation

blefaudeux
Copy link
Contributor

@blefaudeux blefaudeux commented Feb 27, 2021

Same as #52760 which I could not get to land. I just could not live with ghstack/ghimport/randomly broken things, I break enough of them myself, so this is a fresh copy without ghstack shenanigans. I'm hopeful that this can land relatively bug free, and am sorry for the duplications..

What this does:

  • call the common_utils test runner instead of unittest, because it seems that it's how it should be done
  • change the returned state from ZeroRedundancyOptimizer to be PyTorch compliant, which has the added benefit of being elastic (world size independent)

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Feb 27, 2021

💊 CI failures summary and remediations

As of commit 98cc2f3 (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_clang7_onnx_ort_test2 (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Feb 27 05:46:44 E AssertionError: False is not true
Feb 27 05:46:44         size = workspace.FetchBlob(bsize_map[blobs[0]])
Feb 27 05:46:44         self.assertEqual(tuple(), size.shape)
Feb 27 05:46:44         self.assertEqual(min(num_to_collect, max_example_to_cover), size.item())
Feb 27 05:46:44     
Feb 27 05:46:44         hist, _ = np.histogram(
Feb 27 05:46:44             reference_result[:, 0], bins=10, range=(1, max_example_to_cover)
Feb 27 05:46:44         )
Feb 27 05:46:44         print("Sample histogram: {}".format(hist))
Feb 27 05:46:44     
Feb 27 05:46:44 >       self.assertTrue(all(hist > 0.6 * (num_to_collect / 10)))
Feb 27 05:46:44 E       AssertionError: False is not true
Feb 27 05:46:44 
Feb 27 05:46:44 /opt/conda/lib/python3.6/site-packages/caffe2/python/operator_test/dataset_ops_test.py:690: AssertionError
Feb 27 05:46:44 ----------------------------- Captured stdout call -----------------------------
Feb 27 05:46:44 Collect Net Proto: name: "collect_net_10"
Feb 27 05:46:44 op {
Feb 27 05:46:44   input: "blob_1_vec"
Feb 27 05:46:44   input: "blob_2_vec"
Feb 27 05:46:44   input: "blob_3_vec"
Feb 27 05:46:44   input: "blob_1"
Feb 27 05:46:44   input: "blob_2"

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@blefaudeux
Copy link
Contributor Author

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already reviewed on #52760.

@blefaudeux would I be correct if I assume the additional lines compared to #52760 are code format changes and the logic is still the same?

@mrshenli
Copy link
Contributor

The broken unit test seems unrelated

the error indeed look irrelevant. I think it should be safe to land.

Error: No such container:path: fa36bb4f7a10af04bbab8e7159921ade2e68270d129326355f9e4132e21b4be9:/var/lib/jenkins/workspace/test/.coverage

@blefaudeux
Copy link
Contributor Author

Already reviewed on #52760.

@blefaudeux would I be correct if I assume the additional lines compared to #52760 are code format changes and the logic is still the same?

Yes, I'm not sure how the sequence unrolled but it's a simple copy of #52760 except for the unit test runner baked in (so basically 52760 and prior)

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blefaudeux has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@blefaudeux merged this pull request in 249c213.

@blefaudeux blefaudeux deleted the zero_pytorch_state branch February 28, 2021 03:37
aocsa pushed a commit to Quansight/pytorch that referenced this pull request Mar 15, 2021
Summary:
Same as pytorch#52760 which I could not get to land. I just could not live with ghstack/ghimport/randomly broken things, I break enough of them myself, so this is a fresh copy without ghstack shenanigans. I'm hopeful that this can land relatively bug free, and am sorry for the duplications..

What this does:
- call the common_utils test runner instead of unittest, because it seems that it's how it should be done
- change the returned state from ZeroRedundancyOptimizer to be PyTorch compliant, which has the added benefit of being elastic (world size independent)

Pull Request resolved: pytorch#52960

Reviewed By: mrshenli

Differential Revision: D26710932

Pulled By: blefaudeux

fbshipit-source-id: 1d914bc9221442ba1bb2b48f5df10c313e674ece
xsacha pushed a commit to xsacha/pytorch that referenced this pull request Mar 31, 2021
Summary:
Same as pytorch#52760 which I could not get to land. I just could not live with ghstack/ghimport/randomly broken things, I break enough of them myself, so this is a fresh copy without ghstack shenanigans. I'm hopeful that this can land relatively bug free, and am sorry for the duplications..

What this does:
- call the common_utils test runner instead of unittest, because it seems that it's how it should be done
- change the returned state from ZeroRedundancyOptimizer to be PyTorch compliant, which has the added benefit of being elastic (world size independent)

Pull Request resolved: pytorch#52960

Reviewed By: mrshenli

Differential Revision: D26710932

Pulled By: blefaudeux

fbshipit-source-id: 1d914bc9221442ba1bb2b48f5df10c313e674ece
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants