-
Notifications
You must be signed in to change notification settings - Fork 559
Closed
Description
🐛 Bug
test/spmd/test_spmd_debugging.py fails when run on a v4-8 TPU, with the following output:
root@8a626de6d6ae:/ansible/pytorch/xla# python3 -u test/spmd/test_spmd_debugging.py
s
TPU 0 TPU 4 TPU 8 TPU 12 TPU 2 TPU 6 TPU 10 TPU 14
TPU 1 TPU 5 TPU 9 TPU 13 TPU 3 TPU 7 TPU 11 TPU 15
Fs
TPU 0 TPU 1
TPU 2 TPU 3
Fs
TPU [0, 1]
TPU [4, 5]
TPU [8, 9]
TPU [12, 13]
TPU [2, 3]
TPU [6, 7]
TPU [10, 11]
TPU [14, 15]
Fs
TPU [0, 1, 2, 3]
Fs
TPU [0, 1]
TPU [2, 3]
Fs
TPU [0, 1, 2, 3]
F
======================================================================
FAIL: test_debugging_spmd_multi_host_tiled_tpu (__main__.DebuggingSpmdTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test/spmd/test_spmd_debugging.py", line 454, in test_debugging_spmd_multi_host_tiled_tpu
assert output == fake_output
AssertionError
======================================================================
FAIL: test_debugging_spmd_single_host_tiled_tpu (__main__.DebuggingSpmdTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test/spmd/test_spmd_debugging.py", line 109, in test_debugging_spmd_single_host_tiled_tpu
assert output == fake_output
AssertionError
======================================================================
FAIL: test_multi_host_partial_replication_tpu (__main__.DebuggingSpmdTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test/spmd/test_spmd_debugging.py", line 538, in test_multi_host_partial_replication_tpu
assert output == fake_output
AssertionError
======================================================================
FAIL: test_multi_host_replicated_tpu (__main__.DebuggingSpmdTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test/spmd/test_spmd_debugging.py", line 574, in test_multi_host_replicated_tpu
assert output == fake_output
AssertionError
======================================================================
FAIL: test_single_host_partial_replication_tpu (__main__.DebuggingSpmdTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test/spmd/test_spmd_debugging.py", line 160, in test_single_host_partial_replication_tpu
assert output == fake_output
AssertionError
======================================================================
FAIL: test_single_host_replicated_tpu (__main__.DebuggingSpmdTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test/spmd/test_spmd_debugging.py", line 205, in test_single_host_replicated_tpu
assert output == fake_output
AssertionError
----------------------------------------------------------------------
Ran 12 tests in 3.288s
FAILED (failures=6, skipped=6)
I ensured my setup was correct by running test/test_operations.py beforehand, which passed, and skipped individual tests which were not applicable to TPUs.
To Reproduce
Steps to reproduce the behavior:
- Created a v4-8 TPU & ssh'd into it
- Followed setup steps here, and ran
export BUNDLE_LIBTPU=1; export TPUVM_MODE=1before running the pytorch/xla setup script. - Ran
export PJRT_DEVICE=TPU; python3 -u test/test_operations.py -vto ensure my setup was working. - Ran
python3 -u test/spmd/test_spmd_debugging.py, which resulted in the above failure.
Expected behavior
Expected the test to pass.
Environment
- Reproducible on XLA backend [CPU/TPU]: TPU
- torch_xla version: cloned master & ran setup
Additional context
Metadata
Metadata
Assignees
Labels
No labels