Skip to content

Conversation

amrshennawi
Copy link
Contributor

Summary:

This diff is reverting D46920584
D46920584: Make torch.empty* deterministic by filling with NaN or max int value (#101849) by generatedunixname499836121 has been identified to be causing the following test or build failures:

Tests affected:

Here's the Multisect link:
https://www.internalfb.com/multisect/2341386
Here are the tasks that are relevant to this breakage:

We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it.

If you believe this diff has been generated in error you may Commandeer and Abandon it.

Test Plan: NA

Reviewed By: huydhn, osalpekar

Differential Revision: D46997394

@amrshennawi amrshennawi requested a review from kulinseth as a code owner June 27, 2023 22:03
@pytorch-bot
Copy link

pytorch-bot bot commented Jun 27, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/104302

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit dacfafe:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/mps Run MPS tests (subset of trunk) release notes: mps Release notes category labels Jun 27, 2023
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46997394

… build failures (pytorch#104302)

Summary:
Pull Request resolved: pytorch#104302

Pull Request resolved: pytorch#104269

This diff is reverting D46920584
D46920584: Make `torch.empty*` deterministic by filling with NaN or max int value (pytorch#101849) by generatedunixname499836121 has been identified to be causing the following test or build failures:

Tests affected:
- [torchrec/distributed/composable/tests:test_fsdp - torchrec.distributed.composable.tests.test_fsdp.FullyShardTest: test_composable_checkpoint](https://www.internalfb.com/intern/test/281475062923125/)

Here's the Multisect link:
https://www.internalfb.com/multisect/2341386
Here are the tasks that are relevant to this breakage:

We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it.

If you believe this diff has been generated in error you may Commandeer and Abandon it.

Test Plan: NA

Reviewed By: huydhn, osalpekar

Differential Revision: D46997394

fbshipit-source-id: 80a48cf176dd87f26ddb431634df4f713a70648b
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46997394

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) are pending/not yet run. The first few are:

  • EasyCLA

Dig deeper by viewing the pending checks on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@osalpekar
Copy link
Member

/easycla

@osalpekar
Copy link
Member

@pytorchbot merge -f "jedi-landed internally"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@kurtamohler
Copy link
Collaborator

Hey @amrshennawi or @osalpekar, can you please let me know what I can do to attempt to fix this? Is there a minimal reproducer?

@osalpekar
Copy link
Member

Hey @kurtamohler. We ran a bisect and looks like this broke test_fsdp in torchrec:

Traceback (most recent call last):
  File "/usr/local/.../lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/.../lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File ".../torchrec/distributed/composable/tests/test_fsdp.py", line 229, in _run
    assert p_sum.allclose(p_sum_loaded)
AssertionError

@kurtamohler
Copy link
Collaborator

kurtamohler commented Jun 29, 2023

@osalpekar, thanks. Do you know any tips on how to get torchrec to build and run properly from source, or could you point me to someone who might? I followed the instructions in the README and got this error when I try to run tests:

Click to expand
$ torchx run -s local_cwd dist.ddp -j 1x2 --gpu 2 --script test_installation.py

torchx 2023-06-29 18:10:54 INFO     Tracker configurations: {}
torchx 2023-06-29 18:10:54 INFO     Log directory not set in scheduler cfg. Creating a temporary log dir that will be deleted on exit. To preserve log directory set the `log_dir` cfg option
torchx 2023-06-29 18:10:54 INFO     Log directory is: /home/kurtamohler/testtmp/torchx_g1cq5bga
local_cwd://torchx/test_installation-vth3g90w06cltd
torchx 2023-06-29 18:10:54 INFO     Waiting for the app to finish...
test_installation/0 [2023-06-29 18:10:56,675] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
test_installation/0 [2023-06-29 18:10:56,676] torch.distributed.run: [WARNING] 
test_installation/0 *****************************************
test_installation/0 Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
test_installation/0 *****************************************
test_installation/0 [0]:/home/kurtamohler/.conda/envs/pytorch-0/lib/python3.9/site-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv
test_installation/0 [1]:/home/kurtamohler/.conda/envs/pytorch-0/lib/python3.9/site-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv
test_installation/0 [0]:Traceback (most recent call last):
test_installation/0 [0]:  File "/work2/kurtamohler/development/pytorch-0/torch/_ops.py", line 742, in __getattr__
test_installation/0 [0]:    op, overload_names = torch._C._jit_get_operation(qualified_op_name)
test_installation/0 [0]:RuntimeError: No such operator fbgemm::jagged_2d_to_dense
test_installation/0 [0]:
test_installation/0 [0]:The above exception was the direct cause of the following exception:
test_installation/0 [0]:
test_installation/0 [0]:Traceback (most recent call last):
test_installation/0 [0]:  File "/work2/kurtamohler/development/torchrec/test_installation.py", line 16, in <module>
test_installation/0 [0]:    from torchrec import EmbeddingBagCollection, KeyedJaggedTensor
test_installation/0 [0]:  File "/work2/kurtamohler/development/torchrec/torchrec/__init__.py", line 8, in <module>
test_installation/0 [0]:    import torchrec.distributed  # noqa
test_installation/0 [0]:  File "/work2/kurtamohler/development/torchrec/torchrec/distributed/__init__.py", line 36, in <module>
test_installation/0 [0]:    from torchrec.distributed.model_parallel import DistributedModelParallel  # noqa
test_installation/0 [0]:  File "/work2/kurtamohler/development/torchrec/torchrec/distributed/model_parallel.py", line 19, in <module>
test_installation/0 [0]:    from torchrec.distributed.planner import (
test_installation/0 [0]:  File "/work2/kurtamohler/development/torchrec/torchrec/distributed/planner/__init__.py", line 22, in <module>
test_installation/0 [0]:    from torchrec.distributed.planner.planners import EmbeddingShardingPlanner  # noqa
test_installation/0 [0]:  File "/work2/kurtamohler/development/torchrec/torchrec/distributed/planner/planners.py", line 19, in <module>
test_installation/0 [0]:    from torchrec.distributed.planner.constants import BATCH_SIZE, MAX_SIZE
test_installation/0 [0]:  File "/work2/kurtamohler/development/torchrec/torchrec/distributed/planner/constants.py", line 10, in <module>
test_installation/0 [0]:    from torchrec.distributed.embedding_types import EmbeddingComputeKernel
test_installation/0 [0]:  File "/work2/kurtamohler/development/torchrec/torchrec/distributed/embedding_types.py", line 14, in <module>
test_installation/0 [0]:    from fbgemm_gpu.split_table_batched_embeddings_ops import EmbeddingLocation
test_installation/0 [0]:  File "/home/kurtamohler/.conda/envs/pytorch-0/lib/python3.9/site-packages/fbgemm_gpu/__init__.py", line 21, in <module>
test_installation/0 [0]:    from . import _fbgemm_gpu_docs  # noqa: F401, E402
test_installation/0 [0]:  File "/home/kurtamohler/.conda/envs/pytorch-0/lib/python3.9/site-packages/fbgemm_gpu/_fbgemm_gpu_docs.py", line 19, in <module>
test_installation/0 [0]:    torch.ops.fbgemm.jagged_2d_to_dense,
test_installation/0 [0]:  File "/work2/kurtamohler/development/pytorch-0/torch/_ops.py", line 746, in __getattr__
test_installation/0 [0]:    raise AttributeError(
test_installation/0 [0]:AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'jagged_2d_to_dense'
test_installation/0 [1]:Traceback (most recent call last):
test_installation/0 [1]:  File "/work2/kurtamohler/development/pytorch-0/torch/_ops.py", line 742, in __getattr__
test_installation/0 [1]:    op, overload_names = torch._C._jit_get_operation(qualified_op_name)
test_installation/0 [1]:RuntimeError: No such operator fbgemm::jagged_2d_to_dense
test_installation/0 [1]:
test_installation/0 [1]:The above exception was the direct cause of the following exception:
test_installation/0 [1]:
test_installation/0 [1]:Traceback (most recent call last):
test_installation/0 [1]:  File "/work2/kurtamohler/development/torchrec/test_installation.py", line 16, in <module>
test_installation/0 [1]:    from torchrec import EmbeddingBagCollection, KeyedJaggedTensor
test_installation/0 [1]:  File "/work2/kurtamohler/development/torchrec/torchrec/__init__.py", line 8, in <module>
test_installation/0 [1]:    import torchrec.distributed  # noqa
test_installation/0 [1]:  File "/work2/kurtamohler/development/torchrec/torchrec/distributed/__init__.py", line 36, in <module>
test_installation/0 [1]:    from torchrec.distributed.model_parallel import DistributedModelParallel  # noqa
test_installation/0 [1]:  File "/work2/kurtamohler/development/torchrec/torchrec/distributed/model_parallel.py", line 19, in <module>
test_installation/0 [1]:    from torchrec.distributed.planner import (
test_installation/0 [1]:  File "/work2/kurtamohler/development/torchrec/torchrec/distributed/planner/__init__.py", line 22, in <module>
test_installation/0 [1]:    from torchrec.distributed.planner.planners import EmbeddingShardingPlanner  # noqa
test_installation/0 [1]:  File "/work2/kurtamohler/development/torchrec/torchrec/distributed/planner/planners.py", line 19, in <module>
test_installation/0 [1]:    from torchrec.distributed.planner.constants import BATCH_SIZE, MAX_SIZE
test_installation/0 [1]:  File "/work2/kurtamohler/development/torchrec/torchrec/distributed/planner/constants.py", line 10, in <module>
test_installation/0 [1]:    from torchrec.distributed.embedding_types import EmbeddingComputeKernel
test_installation/0 [1]:  File "/work2/kurtamohler/development/torchrec/torchrec/distributed/embedding_types.py", line 14, in <module>
test_installation/0 [1]:    from fbgemm_gpu.split_table_batched_embeddings_ops import EmbeddingLocation
test_installation/0 [1]:  File "/home/kurtamohler/.conda/envs/pytorch-0/lib/python3.9/site-packages/fbgemm_gpu/__init__.py", line 21, in <module>
test_installation/0 [1]:    from . import _fbgemm_gpu_docs  # noqa: F401, E402
test_installation/0 [1]:  File "/home/kurtamohler/.conda/envs/pytorch-0/lib/python3.9/site-packages/fbgemm_gpu/_fbgemm_gpu_docs.py", line 19, in <module>
test_installation/0 [1]:    torch.ops.fbgemm.jagged_2d_to_dense,
test_installation/0 [1]:  File "/work2/kurtamohler/development/pytorch-0/torch/_ops.py", line 746, in __getattr__
test_installation/0 [1]:    raise AttributeError(
test_installation/0 [1]:AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'jagged_2d_to_dense'
test_installation/0 [2023-06-29 18:11:02,145] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1046) of binary: /home/kurtamohler/.conda/envs/pytorch-0/bin/python
test_installation/0 Traceback (most recent call last):
test_installation/0   File "/home/kurtamohler/.conda/envs/pytorch-0/bin/torchrun", line 33, in <module>
test_installation/0     sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
test_installation/0   File "/work2/kurtamohler/development/pytorch-0/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
test_installation/0     return f(*args, **kwargs)
test_installation/0   File "/work2/kurtamohler/development/pytorch-0/torch/distributed/run.py", line 797, in main
test_installation/0     run(args)
test_installation/0   File "/work2/kurtamohler/development/pytorch-0/torch/distributed/run.py", line 788, in run
test_installation/0     elastic_launch(
test_installation/0   File "/work2/kurtamohler/development/pytorch-0/torch/distributed/launcher/api.py", line 134, in __call__
test_installation/0     return launch_agent(self._config, self._entrypoint, list(args))
test_installation/0   File "/work2/kurtamohler/development/pytorch-0/torch/distributed/launcher/api.py", line 264, in launch_agent
test_installation/0     raise ChildFailedError(
test_installation/0 torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
test_installation/0 ============================================================
test_installation/0 test_installation.py FAILED
test_installation/0 ------------------------------------------------------------
test_installation/0 Failures:
test_installation/0 [1]:
test_installation/0   time      : 2023-06-29_18:11:02
test_installation/0   host      : qgpu2.lan
test_installation/0   rank      : 1 (local_rank: 1)
test_installation/0   exitcode  : 1 (pid: 1047)
test_installation/0   error_file: <N/A>
test_installation/0   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
test_installation/0 ------------------------------------------------------------
test_installation/0 Root Cause (first observed failure):
test_installation/0 [0]:
test_installation/0   time      : 2023-06-29_18:11:02
test_installation/0   host      : qgpu2.lan
test_installation/0   rank      : 0 (local_rank: 0)
test_installation/0   exitcode  : 1 (pid: 1046)
test_installation/0   error_file: <N/A>
test_installation/0   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
test_installation/0 ============================================================
torchx 2023-06-29 18:11:02 INFO     Job finished: FAILED
torchx 2023-06-29 18:11:02 ERROR    AppStatus:
  msg: <NONE>
  num_restarts: 0
  roles: []
  state: FAILED (5)
  structured_error_msg: <NONE>
  ui_url: file:///home/kurtamohler/testtmp/torchx_g1cq5bga/torchx/test_installation-vth3g90w06cltd

I tried using both pytorch-nightly and a locally built pytorch checkout. I also tried checking out the most recent release branch of torchrec. In all cases, I get the same error.

EDIT: I found that the install instructions for fbgemm mention this error: https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/docs/InstallationInstructions.md#undefined-symbols

I still haven't gotten it to work yet, but I must be on the right track

@osalpekar
Copy link
Member

@kurtamohler Here is a link to the scripts used to build torchrec and run the unittests in our CI pipeline: https://github.com/pytorch/torchrec/blob/main/.github/workflows/unittest_ci_cpu.yml#L41-L70. cc @YLGH might be able to help

@kurtamohler
Copy link
Collaborator

kurtamohler commented Jul 5, 2023

Thanks @osalpekar! I'm able to get torchrec running with that script. But I'm still having trouble getting my local pytorch build working with it. I tried building pytorch with USE_CUDA=0, since the CI workflow is using pytorch-nightly with cpu-only, but that didn't work.

The problem seems to be when fbgemm_gpu is imported after torch:

>>> import torch
>>> import fbgemm_gpu
/home/kurtamohler/.conda/envs/pytorch-0-copy-2/lib/python3.9/site-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv
Traceback (most recent call last):
  File "/home/kurtamohler/develop/pytorch-0/torch/_ops.py", line 566, in __getattr__
    op, overload_names = torch._C._jit_get_operation(qualified_op_name)
RuntimeError: No such operator fbgemm::jagged_2d_to_dense

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/kurtamohler/.conda/envs/pytorch-0-copy-2/lib/python3.9/site-packages/fbgemm_gpu/__init__.py", line 21, in <module>
    from . import _fbgemm_gpu_docs  # noqa: F401, E402
  File "/home/kurtamohler/.conda/envs/pytorch-0-copy-2/lib/python3.9/site-packages/fbgemm_gpu/_fbgemm_gpu_docs.py", line 19, in <module>
    torch.ops.fbgemm.jagged_2d_to_dense,
  File "/home/kurtamohler/develop/pytorch-0/torch/_ops.py", line 570, in __getattr__
    raise AttributeError(
AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'jagged_2d_to_dense'

torch._C._jit_get_operation doesn't recognize the fbgemm operations.

I guess there must be some difference about how I'm building pytorch compared to how pytorch-nightly gets built. I have the following pytorch build flags set:

USE_MKLDNN=0
USE_XNNPACK=1
USE_FBGEMM=1
USE_CUDA=0
USE_NCCL=1

cc @YLGH

EDIT: Nevermind, I was finally able to get torchrec built and running properly with my local pytorch main checkout. Now I just have to cherry-pick my deterministic torch.empty commit, rebuild, and then I should be able to reproduce the failure.

@kurtamohler
Copy link
Collaborator

I've reproduced the error. I figured I would document what I did just in case it's needed in the future.

Click to expand

checkout torchrec commit: ba943982df9a1498cc09bf5484928dd47e47efcf

checkout pytorch commit: d3ba890
cherry-pick torch.empty deterministic commit: 2642f31

Go to pytorch/third_party/fbgemm:

mkdir build
cd build
cmake -DUSE_SANITIZER=address -DFBGEMM_LIBRARY_TYPE=shared -DPYTHON_EXECUTABLE=/usr/bin/python3 ..
make -j VERBOSE=1

Go to pytorch/third_party/fbgemm/fbgemm_gpu:

source ../.github/scripts/fbgemm_gpu_build.bash
prepare_fbgemm_gpu_build [env-name]
build_fbgemm_gpu_package [env-name] nightly cuda
pip install -e .

Go to pytorch/:

USE_FBGEMM=1 USE_CUDA=1 python setup.py develop

Check that you can do this:

python -c 'import torch; import fbgemm_gpu'

Go to torchrec/:

sed -i 's/fbgemm-gpu-nightly.*$//g' requirements.txt
pip install -r requirements.txt
conda install -y pytest
python setup.py bdist_wheel --package_name torchrec-nightly --python-tag=py39

Check that you can import it:

python -c "import torchrec"

Run the failing test:

python -m pytest torchrec -v -s -W ignore::pytest.PytestCollectionWarning --continue-on-collection-errors -k fsdp

@kurtamohler
Copy link
Collaborator

kurtamohler commented Jul 6, 2023

The problem is that torchrec is using the results of torch.empty as inputs to operations in the failing test, and then the output is being used for comparisons. Since deterministic torch.empty fills the result with NaN, the final output is also NaN. Any comparison with a NaN fails.

In the case of the failing fsdp test, I found that replacing the empty call on this line with randn makes the test pass: https://github.com/pytorch/torchrec/blob/f11b8ecc3a2f58929b3957494dccbd465267c7f2/torchrec/distributed/test_utils/test_model.py#L282

So torchrec will have to be updated to avoid using empty tensors as inputs to operations, at least in this one case.

@osalpekar, how should we coordinate this change?

@kurtamohler
Copy link
Collaborator

kurtamohler commented Jul 11, 2023

I've submitted the above issue in torchrec to coordinate the change

@osalpekar
Copy link
Member

Thanks @kurtamohler and @YLGH for coordinating the fixes. Once the change is merged in torchrec, feel free to re-open and merge the original PyTorch PR!

pytorchmergebot pushed a commit that referenced this pull request Jul 13, 2023
…int (#104995)

Relands #101849 after #104302 reverted it.

torchrec PR meta-pytorch/torchrec#1269 fixes the torchrec failure that caused #101849 to be reverted

Part of #82004

Pull Request resolved: #104995
Approved by: https://github.com/albanD
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/mps Run MPS tests (subset of trunk) fb-exported Merged release notes: mps Release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants