Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISABLED test_2d_fsdp_tp_ac_compile (__main__.TestDTensorCompileE2E) #113781

Closed
clee2000 opened this issue Nov 15, 2023 · 4 comments
Closed

DISABLED test_2d_fsdp_tp_ac_compile (__main__.TestDTensorCompileE2E) #113781

clee2000 opened this issue Nov 15, 2023 · 4 comments
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue oncall: pt2 skipped Denotes a (flaky) test currently skipped in CI. triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@clee2000
Copy link
Contributor

clee2000 commented Nov 15, 2023

Platforms: linux

Broken on multigpu

To reenable on your PR, put Fixes #<this issue number> in the PR body and add the ciflow/periodic tag to trigger multigpu

Probably caused by #113547 or something in its stack @wanchaol do you mind providing a forward fix?

First known bad: https://hud.pytorch.org/pytorch/pytorch/commit/93372455a73043332c16a71cb9dccdf3e0412a57
Last known good: https://hud.pytorch.org/pytorch/pytorch/commit/a1e3c501652101e8b37baac62216db7ca22c9923

Ex. https://github.com/pytorch/pytorch/actions/runs/6863856295/job/18665805628

_______________ TestDTensorCompileE2E.test_2d_fsdp_tp_ac_compile _______________
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 542, in wrapper
    self._join_processes(fn)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 761, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 811, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
    getattr(self, test_name)()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
    fn()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2575, in wrapper
    method(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 193, in wrapper
    func(self, *args, **kwargs)  # type: ignore[misc]
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper
    return func(*args, **kwargs)
  File "/var/lib/jenkins/workspace/test/distributed/_tensor/test_dtensor_compile.py", line 328, in test_2d_fsdp_tp_ac_compile
    compiled_output = compiled_2d(inp)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 408, in _fn
    return fn(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 840, in forward
    args, kwargs = _pre_forward(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 412, in _pre_forward
    unshard_fn(state, handle)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 447, in _pre_forward_unshard
    _unshard(state, handle, state._unshard_stream, state._pre_unshard_stream)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 331, in _unshard
    handle.unshard()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1272, in unshard
    self._use_unsharded_flat_param(padded_unsharded_flat_param)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1404, in _use_unsharded_flat_param
    self._use_unsharded_views(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1847, in _use_unsharded_views
    views = self._get_unflat_views()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1824, in _get_unflat_views_aligned
    _ext_post_unflatten_transform(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_fsdp_extensions.py", line 113, in _ext_post_unflatten_transform
    return fsdp_extension.post_unflatten_transform(tensor, param_extension)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/fsdp.py", line 334, in post_unflatten_transform
    result = _unflatten_tensor(tensor, param_extension)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 569, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 671, in _convert_frame
    result = inner_convert(frame, cache_entry, hooks, frame_state)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 377, in _convert_frame_assert
    return _compile(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 614, in _compile
    raise InternalTorchDynamoError(str(e)).with_traceback(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 595, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 243, in time_wrapper
    r = func(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 512, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
    transformations(instructions, code_options)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 150, in _fn
    return fn(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 477, in transform
    tracer.run()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2120, in run
    super().run()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 815, in run
    and self.step()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 778, in step
    getattr(self, inst.opname)(inst)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 469, in wrapper
    return inner_fn(self, inst)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1259, in CALL_FUNCTION_KW
    self.call_function(fn, args, kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 650, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/torch.py", line 572, in call_function
    kwargs_as_value = {k: v.as_python_constant() for k, v in kwargs.items()}
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/torch.py", line 572, in <dictcomp>
    kwargs_as_value = {k: v.as_python_constant() for k, v in kwargs.items()}
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/lists.py", line 66, in as_python_constant
    return self.python_type()([x.as_python_constant() for x in self.items])
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/lists.py", line 66, in <listcomp>
    return self.python_type()([x.as_python_constant() for x in self.items])
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/base.py", line 238, in as_python_constant
    raise NotImplementedError(f"{self} is not a constant")
torch._dynamo.exc.InternalTorchDynamoError: SymNodeVariable() is not a constant

from user code:
   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/_data_parallel_utils.py", line 18, in _unflatten_tensor
    result = DistributedTensor.from_local(

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True


To execute this test, run the following from the base repo dir:
     python test/distributed/_tensor/test_dtensor_compile.py -k test_2d_fsdp_tp_ac_compile

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0


----------------------------- Captured stdout call -----------------------------
Process 2 terminated with exit code 10, terminating remaining processes.
------------------------------ Captured log call -------------------------------
INFO     numba.cuda.cudadrv.driver:driver.py:245 init

This test was disabled because it is failing on main branch (recent examples).

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @ezyang @msaroufim @wconstab @bdhirsh @anijain2305 @zou3519

@pytorch-bot pytorch-bot bot added the skipped Denotes a (flaky) test currently skipped in CI. label Nov 15, 2023
Copy link

pytorch-bot bot commented Nov 15, 2023

Hello there! From the DISABLED prefix in this issue title, it looks like you are attempting to disable a test in PyTorch CI. The information I have parsed is below:
  • Test name: test_2d_fsdp_tp_ac_compile (__main__.TestDTensorCompileE2E)
  • Platforms for which to skip the test: linux
  • Disabled by clee2000

Within ~15 minutes, test_2d_fsdp_tp_ac_compile (__main__.TestDTensorCompileE2E) will be disabled in PyTorch CI for these platforms: linux. Please verify that your test name looks correct, e.g., test_cuda_assert_async (__main__.TestCuda).

To modify the platforms list, please include a line in the issue body, like below. The default action will disable the test for all platforms if no platforms list is specified.

Platforms: case-insensitive, list, of, platforms

We currently support the following platforms: asan, dynamo, inductor, linux, mac, macos, rocm, slow, win, windows.

@huydhn
Copy link
Contributor

huydhn commented Nov 21, 2023

Reopen this issue because the test starts failing on multigpu after https://hud.pytorch.org/pytorch/pytorch/commit/3e49621f3b4652b8e7782aa8dafb28f9d985598b. The error looks relevant unhashable type: non-singleton SymInt, so a forward fix would be needed I think

cc @awgu

@huydhn huydhn reopened this Nov 21, 2023
@bdhirsh bdhirsh added oncall: distributed Add this issue/PR to distributed oncall triage queue oncall: pt2 labels Nov 21, 2023
@awgu
Copy link
Contributor

awgu commented Nov 22, 2023

@huydhn I am working on a fix!

pytorchmergebot pushed a commit that referenced this issue Nov 22, 2023
This is a forward fix for #113781.

We lazily compute the hash so that we do not try to compute the hash on `SymInt`s (for the stride) during Dynamo tracing.

Tested via:
```
python test/distributed/_tensor/test_dtensor_compile.py -k test_2d_fsdp_tp_ac_compile
```
Pull Request resolved: #114322
Approved by: https://github.com/wanchaol
ghstack dependencies: #113919, #113924, #114134, #113925, #113930, #114141, #113915, #114140
@awgu
Copy link
Contributor

awgu commented Nov 22, 2023

Closing this as it should be fixed now.

@awgu awgu closed this as completed Nov 22, 2023
@ezyang ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue oncall: pt2 skipped Denotes a (flaky) test currently skipped in CI. triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

6 participants