-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Revert D46920584: Multisect successfully blamed D46920584 for test or build failures (#104269) #104302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/104302
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit dacfafe: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D46997394 |
… build failures (pytorch#104302) Summary: Pull Request resolved: pytorch#104302 Pull Request resolved: pytorch#104269 This diff is reverting D46920584 D46920584: Make `torch.empty*` deterministic by filling with NaN or max int value (pytorch#101849) by generatedunixname499836121 has been identified to be causing the following test or build failures: Tests affected: - [torchrec/distributed/composable/tests:test_fsdp - torchrec.distributed.composable.tests.test_fsdp.FullyShardTest: test_composable_checkpoint](https://www.internalfb.com/intern/test/281475062923125/) Here's the Multisect link: https://www.internalfb.com/multisect/2341386 Here are the tasks that are relevant to this breakage: We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it. If you believe this diff has been generated in error you may Commandeer and Abandon it. Test Plan: NA Reviewed By: huydhn, osalpekar Differential Revision: D46997394 fbshipit-source-id: 80a48cf176dd87f26ddb431634df4f713a70648b
54eae6f
to
dacfafe
Compare
This pull request was exported from Phabricator. Differential Revision: D46997394 |
@pytorchbot merge -f 'Landed internally' (Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally) |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) are pending/not yet run. The first few are:
Dig deeper by viewing the pending checks on hud |
/easycla |
@pytorchbot merge -f "jedi-landed internally" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Hey @amrshennawi or @osalpekar, can you please let me know what I can do to attempt to fix this? Is there a minimal reproducer? |
Hey @kurtamohler. We ran a bisect and looks like this broke
|
@osalpekar, thanks. Do you know any tips on how to get Click to expand
I tried using both pytorch-nightly and a locally built pytorch checkout. I also tried checking out the most recent release branch of torchrec. In all cases, I get the same error. EDIT: I found that the install instructions for fbgemm mention this error: https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/docs/InstallationInstructions.md#undefined-symbols I still haven't gotten it to work yet, but I must be on the right track |
@kurtamohler Here is a link to the scripts used to build torchrec and run the unittests in our CI pipeline: https://github.com/pytorch/torchrec/blob/main/.github/workflows/unittest_ci_cpu.yml#L41-L70. cc @YLGH might be able to help |
Thanks @osalpekar! I'm able to get torchrec running with that script. But I'm still having trouble getting my local pytorch build working with it. I tried building pytorch with The problem seems to be when >>> import torch
>>> import fbgemm_gpu
/home/kurtamohler/.conda/envs/pytorch-0-copy-2/lib/python3.9/site-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv
Traceback (most recent call last):
File "/home/kurtamohler/develop/pytorch-0/torch/_ops.py", line 566, in __getattr__
op, overload_names = torch._C._jit_get_operation(qualified_op_name)
RuntimeError: No such operator fbgemm::jagged_2d_to_dense
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/kurtamohler/.conda/envs/pytorch-0-copy-2/lib/python3.9/site-packages/fbgemm_gpu/__init__.py", line 21, in <module>
from . import _fbgemm_gpu_docs # noqa: F401, E402
File "/home/kurtamohler/.conda/envs/pytorch-0-copy-2/lib/python3.9/site-packages/fbgemm_gpu/_fbgemm_gpu_docs.py", line 19, in <module>
torch.ops.fbgemm.jagged_2d_to_dense,
File "/home/kurtamohler/develop/pytorch-0/torch/_ops.py", line 570, in __getattr__
raise AttributeError(
AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'jagged_2d_to_dense'
I guess there must be some difference about how I'm building pytorch compared to how pytorch-nightly gets built. I have the following pytorch build flags set:
cc @YLGH EDIT: Nevermind, I was finally able to get torchrec built and running properly with my local pytorch main checkout. Now I just have to cherry-pick my deterministic |
I've reproduced the error. I figured I would document what I did just in case it's needed in the future. Click to expandcheckout torchrec commit: ba943982df9a1498cc09bf5484928dd47e47efcf checkout pytorch commit: d3ba890 Go to
Go to
Go to
Check that you can do this:
Go to
Check that you can import it:
Run the failing test:
|
The problem is that In the case of the failing fsdp test, I found that replacing the So @osalpekar, how should we coordinate this change? |
I've submitted the above issue in torchrec to coordinate the change |
Thanks @kurtamohler and @YLGH for coordinating the fixes. Once the change is merged in torchrec, feel free to re-open and merge the original PyTorch PR! |
…int (#104995) Relands #101849 after #104302 reverted it. torchrec PR meta-pytorch/torchrec#1269 fixes the torchrec failure that caused #101849 to be reverted Part of #82004 Pull Request resolved: #104995 Approved by: https://github.com/albanD
Summary:
This diff is reverting D46920584
D46920584: Make
torch.empty*
deterministic by filling with NaN or max int value (#101849) by generatedunixname499836121 has been identified to be causing the following test or build failures:Tests affected:
Here's the Multisect link:
https://www.internalfb.com/multisect/2341386
Here are the tasks that are relevant to this breakage:
We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it.
If you believe this diff has been generated in error you may Commandeer and Abandon it.
Test Plan: NA
Reviewed By: huydhn, osalpekar
Differential Revision: D46997394