Skip to content

[CI Failure][AMD]: test_prefix_prefill failing on AMD with numeric issues #28490

@zhewenl

Description

@zhewenl

Name of failing test

tests/kernels/attention/test_prefix_prefill.py

Basic information

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

In #28424 we refactored the test from using xformers to pytorch SDPA since xformers are not supported on ROCm & fixed some incompatibilities on AMD:
a. The ROCm paged attention kernel expects 32-bit int tensors, but the test passes 64-bit torch.long tensors.
b. ROCm paged attention kernel only supports auto, fp8, and fp8_e4m3 KV cache dtypes.

However, the test is still failing on MI300with some numeric issues: https://gist.github.com/zhewenl/3224057e57aad300341c8a0d66bd9878

pytest -s -v 'tests/kernels/attention/test_prefix_prefill.py'
...
================================================================= short test summary info ==================================================================
FAILED tests/kernels/attention/test_prefix_prefill.py::test_contexted_kv_attention[chunked_prefill_paged_decode-0-cuda:0-auto-dtype0-128-1-64] - AssertionError: Tensor-likes are not close!
FAILED tests/kernels/attention/test_prefix_prefill.py::test_contexted_kv_attention[chunked_prefill_paged_decode-0-cuda:0-fp8-dtype0-128-1-64] - AssertionError: Tensor-likes are not close!
FAILED tests/kernels/attention/test_prefix_prefill.py::test_contexted_kv_attention[chunked_prefill_paged_decode-0-cuda:1-auto-dtype0-128-1-64] - AssertionError: Tensor-likes are not close!
FAILED tests/kernels/attention/test_prefix_prefill.py::test_contexted_kv_attention[chunked_prefill_paged_decode-0-cuda:1-fp8-dtype0-128-1-64] - AssertionError: Tensor-likes are not close!

Results (299.93s (0:04:59)):
     156 passed
       4 failed
         - tests/kernels/attention/test_prefix_prefill.py:101 test_contexted_kv_attention[chunked_prefill_paged_decode-0-cuda:0-auto-dtype0-128-1-64]
         - tests/kernels/attention/test_prefix_prefill.py:101 test_contexted_kv_attention[chunked_prefill_paged_decode-0-cuda:0-fp8-dtype0-128-1-64]
         - tests/kernels/attention/test_prefix_prefill.py:101 test_contexted_kv_attention[chunked_prefill_paged_decode-0-cuda:1-auto-dtype0-128-1-64]
         - tests/kernels/attention/test_prefix_prefill.py:101 test_contexted_kv_attention[chunked_prefill_paged_decode-0-cuda:1-fp8-dtype0-128-1-64]
     224 skipped

📝 History of failing test

CI

CC List.

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    ci-failureIssue about an unexpected test failure in CIrocmRelated to AMD ROCm

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions