Skip to content

test/spmd/test_spmd_debugging.py fails when run on a v4-8 TPU #6252

@mbzomowski

Description

@mbzomowski

🐛 Bug

test/spmd/test_spmd_debugging.py fails when run on a v4-8 TPU, with the following output:

root@8a626de6d6ae:/ansible/pytorch/xla# python3 -u test/spmd/test_spmd_debugging.py
s                                                           
 TPU 0  TPU 4  TPU 8  TPU 12  TPU 2  TPU 6  TPU 10  TPU 14 
                                                           
                                                           
 TPU 1  TPU 5  TPU 9  TPU 13  TPU 3  TPU 7  TPU 11  TPU 15 
                                                           
Fs              
 TPU 0  TPU 1 
              
              
 TPU 2  TPU 3 
              
Fs              
  TPU [0, 1]  
              
              
  TPU [4, 5]  
              
              
  TPU [8, 9]  
              
              
 TPU [12, 13] 
              
              
  TPU [2, 3]  
              
              
  TPU [6, 7]  
              
              
 TPU [10, 11] 
              
              
 TPU [14, 15] 
              
Fs                  
 TPU [0, 1, 2, 3] 
                  
Fs            
 TPU [0, 1] 
            
            
 TPU [2, 3] 
            
Fs                  
 TPU [0, 1, 2, 3] 
                  
F
======================================================================
FAIL: test_debugging_spmd_multi_host_tiled_tpu (__main__.DebuggingSpmdTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/spmd/test_spmd_debugging.py", line 454, in test_debugging_spmd_multi_host_tiled_tpu
    assert output == fake_output
AssertionError

======================================================================
FAIL: test_debugging_spmd_single_host_tiled_tpu (__main__.DebuggingSpmdTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/spmd/test_spmd_debugging.py", line 109, in test_debugging_spmd_single_host_tiled_tpu
    assert output == fake_output
AssertionError

======================================================================
FAIL: test_multi_host_partial_replication_tpu (__main__.DebuggingSpmdTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/spmd/test_spmd_debugging.py", line 538, in test_multi_host_partial_replication_tpu
    assert output == fake_output
AssertionError

======================================================================
FAIL: test_multi_host_replicated_tpu (__main__.DebuggingSpmdTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/spmd/test_spmd_debugging.py", line 574, in test_multi_host_replicated_tpu
    assert output == fake_output
AssertionError

======================================================================
FAIL: test_single_host_partial_replication_tpu (__main__.DebuggingSpmdTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/spmd/test_spmd_debugging.py", line 160, in test_single_host_partial_replication_tpu
    assert output == fake_output
AssertionError

======================================================================
FAIL: test_single_host_replicated_tpu (__main__.DebuggingSpmdTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/spmd/test_spmd_debugging.py", line 205, in test_single_host_replicated_tpu
    assert output == fake_output
AssertionError

----------------------------------------------------------------------
Ran 12 tests in 3.288s

FAILED (failures=6, skipped=6)

I ensured my setup was correct by running test/test_operations.py beforehand, which passed, and skipped individual tests which were not applicable to TPUs.

To Reproduce

Steps to reproduce the behavior:

  1. Created a v4-8 TPU & ssh'd into it
  2. Followed setup steps here, and ran export BUNDLE_LIBTPU=1; export TPUVM_MODE=1 before running the pytorch/xla setup script.
  3. Ran export PJRT_DEVICE=TPU; python3 -u test/test_operations.py -v to ensure my setup was working.
  4. Ran python3 -u test/spmd/test_spmd_debugging.py, which resulted in the above failure.

Expected behavior

Expected the test to pass.

Environment

  • Reproducible on XLA backend [CPU/TPU]: TPU
  • torch_xla version: cloned master & ran setup

Additional context

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions