DISABLED test_2d_fsdp_tp_ac_compile (main.TestDTensorCompileE2E) #113781

clee2000 · 2023-11-15T17:59:31Z

Platforms: linux

Broken on multigpu

To reenable on your PR, put Fixes #<this issue number> in the PR body and add the ciflow/periodic tag to trigger multigpu

Probably caused by #113547 or something in its stack @wanchaol do you mind providing a forward fix?

First known bad: https://hud.pytorch.org/pytorch/pytorch/commit/93372455a73043332c16a71cb9dccdf3e0412a57
Last known good: https://hud.pytorch.org/pytorch/pytorch/commit/a1e3c501652101e8b37baac62216db7ca22c9923

Ex. https://github.com/pytorch/pytorch/actions/runs/6863856295/job/18665805628

_______________ TestDTensorCompileE2E.test_2d_fsdp_tp_ac_compile _______________
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 542, in wrapper
    self._join_processes(fn)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 761, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 811, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
    getattr(self, test_name)()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
    fn()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2575, in wrapper
    method(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 193, in wrapper
    func(self, *args, **kwargs)  # type: ignore[misc]
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper
    return func(*args, **kwargs)
  File "/var/lib/jenkins/workspace/test/distributed/_tensor/test_dtensor_compile.py", line 328, in test_2d_fsdp_tp_ac_compile
    compiled_output = compiled_2d(inp)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 408, in _fn
    return fn(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 840, in forward
    args, kwargs = _pre_forward(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 412, in _pre_forward
    unshard_fn(state, handle)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 447, in _pre_forward_unshard
    _unshard(state, handle, state._unshard_stream, state._pre_unshard_stream)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 331, in _unshard
    handle.unshard()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1272, in unshard
    self._use_unsharded_flat_param(padded_unsharded_flat_param)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1404, in _use_unsharded_flat_param
    self._use_unsharded_views(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1847, in _use_unsharded_views
    views = self._get_unflat_views()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1824, in _get_unflat_views_aligned
    _ext_post_unflatten_transform(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_fsdp_extensions.py", line 113, in _ext_post_unflatten_transform
    return fsdp_extension.post_unflatten_transform(tensor, param_extension)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/fsdp.py", line 334, in post_unflatten_transform
    result = _unflatten_tensor(tensor, param_extension)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 569, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 671, in _convert_frame
    result = inner_convert(frame, cache_entry, hooks, frame_state)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 377, in _convert_frame_assert
    return _compile(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 614, in _compile
    raise InternalTorchDynamoError(str(e)).with_traceback(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 595, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 243, in time_wrapper
    r = func(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 512, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
    transformations(instructions, code_options)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 150, in _fn
    return fn(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 477, in transform
    tracer.run()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2120, in run
    super().run()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 815, in run
    and self.step()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 778, in step
    getattr(self, inst.opname)(inst)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 469, in wrapper
    return inner_fn(self, inst)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1259, in CALL_FUNCTION_KW
    self.call_function(fn, args, kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 650, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/torch.py", line 572, in call_function
    kwargs_as_value = {k: v.as_python_constant() for k, v in kwargs.items()}
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/torch.py", line 572, in <dictcomp>
    kwargs_as_value = {k: v.as_python_constant() for k, v in kwargs.items()}
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/lists.py", line 66, in as_python_constant
    return self.python_type()([x.as_python_constant() for x in self.items])
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/lists.py", line 66, in <listcomp>
    return self.python_type()([x.as_python_constant() for x in self.items])
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/base.py", line 238, in as_python_constant
    raise NotImplementedError(f"{self} is not a constant")
torch._dynamo.exc.InternalTorchDynamoError: SymNodeVariable() is not a constant

from user code:
   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/_data_parallel_utils.py", line 18, in _unflatten_tensor
    result = DistributedTensor.from_local(

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True


To execute this test, run the following from the base repo dir:
     python test/distributed/_tensor/test_dtensor_compile.py -k test_2d_fsdp_tp_ac_compile

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0


----------------------------- Captured stdout call -----------------------------
Process 2 terminated with exit code 10, terminating remaining processes.
------------------------------ Captured log call -------------------------------
INFO     numba.cuda.cudadrv.driver:driver.py:245 init

This test was disabled because it is failing on main branch (recent examples).

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @ezyang @msaroufim @wconstab @bdhirsh @anijain2305 @zou3519

The text was updated successfully, but these errors were encountered:

pytorch-bot · 2023-11-15T17:59:34Z

Hello there! From the DISABLED prefix in this issue title, it looks like you are attempting to disable a test in PyTorch CI. The information I have parsed is below:

Test name: test_2d_fsdp_tp_ac_compile (__main__.TestDTensorCompileE2E)
Platforms for which to skip the test: linux
Disabled by clee2000

Within ~15 minutes, test_2d_fsdp_tp_ac_compile (__main__.TestDTensorCompileE2E) will be disabled in PyTorch CI for these platforms: linux. Please verify that your test name looks correct, e.g., test_cuda_assert_async (__main__.TestCuda).

To modify the platforms list, please include a line in the issue body, like below. The default action will disable the test for all platforms if no platforms list is specified.

Platforms: case-insensitive, list, of, platforms

We currently support the following platforms: asan, dynamo, inductor, linux, mac, macos, rocm, slow, win, windows.

huydhn · 2023-11-21T22:09:56Z

Reopen this issue because the test starts failing on multigpu after https://hud.pytorch.org/pytorch/pytorch/commit/3e49621f3b4652b8e7782aa8dafb28f9d985598b. The error looks relevant unhashable type: non-singleton SymInt, so a forward fix would be needed I think

cc @awgu

awgu · 2023-11-22T00:19:53Z

@huydhn I am working on a fix!

This is a forward fix for #113781. We lazily compute the hash so that we do not try to compute the hash on `SymInt`s (for the stride) during Dynamo tracing. Tested via: ``` python test/distributed/_tensor/test_dtensor_compile.py -k test_2d_fsdp_tp_ac_compile ``` Pull Request resolved: #114322 Approved by: https://github.com/wanchaol ghstack dependencies: #113919, #113924, #114134, #113925, #113930, #114141, #113915, #114140

awgu · 2023-11-22T04:58:32Z

Closing this as it should be fixed now.

pytorch-bot bot added the skipped Denotes a (flaky) test currently skipped in CI. label Nov 15, 2023

wanchaol closed this as completed Nov 15, 2023

huydhn reopened this Nov 21, 2023

bdhirsh added oncall: distributed Add this issue/PR to distributed oncall triage queue oncall: pt2 labels Nov 21, 2023

awgu mentioned this issue Nov 22, 2023

[DTensor] Computed DTensorSpec hash lazily #114322

Closed

awgu closed this as completed Nov 22, 2023

ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DISABLED test_2d_fsdp_tp_ac_compile (main.TestDTensorCompileE2E) #113781

DISABLED test_2d_fsdp_tp_ac_compile (main.TestDTensorCompileE2E) #113781

clee2000 commented Nov 15, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 15, 2023

huydhn commented Nov 21, 2023

awgu commented Nov 22, 2023

awgu commented Nov 22, 2023

DISABLED test_2d_fsdp_tp_ac_compile (__main__.TestDTensorCompileE2E) #113781

DISABLED test_2d_fsdp_tp_ac_compile (__main__.TestDTensorCompileE2E) #113781

Comments

clee2000 commented Nov 15, 2023 • edited by pytorch-bot bot Loading

pytorch-bot bot commented Nov 15, 2023

huydhn commented Nov 21, 2023

awgu commented Nov 22, 2023

awgu commented Nov 22, 2023

DISABLED test_2d_fsdp_tp_ac_compile (main.TestDTensorCompileE2E) #113781

DISABLED test_2d_fsdp_tp_ac_compile (main.TestDTensorCompileE2E) #113781

clee2000 commented Nov 15, 2023 •

edited by pytorch-bot bot

Loading