Skip to content

[cuBLAS] Gemm tests using half can fail #599

Open
@Rbiessy

Description

@Rbiessy

Summary

cuBLAS tests running Gemm with half precision can fail with wrong results on A100.

Version

Using the tip of develop as of today (6923d40).

Environment

Using A100 with the DPC++ release 2024.2.0 and the associated Codeplay Nvidia plugin. The CUDA version is 12.6.2, OS is Ubuntu 22.04.

Steps to reproduce

cmake -Bbuild-a100 -GNinja -DCMAKE_CXX_COMPILER=`which icpx` -DENABLE_MKLCPU_BACKEND=OFF -DENABLE_MKLGPU_BACKEND=OFF -DENABLE_CUBLAS_BACKEND=ON -DENABLE_CURAND_BACKEND=ON -DENABLE_CUSOLVER_BACKEND=ON -DENABLE_CUFFT_BACKEND=ON -DREF_BLAS_ROOT=/path/to/lapack/install -DREF_LAPACK_ROOT=/path/to/lapack/install .
cd build-a100
ninja
ctest -R ".*GemmUsmTests.*Half.*" --output-on-failure

Observed behavior

Full log: log_a100.txt
Short extract:

[ RUN      ] GemmUsmTestSuite/GemmUsmTests.HalfHalfFloatPrecision/Column_Major_NVIDIA_A100_PCIE_40GB
relative error = 0.496206 absolute error = 0.382722 limit = 0.00010848
Difference in entry (0,0): DPC++ 0.388574 vs. Reference 0.771296
relative error = 1.36303 absolute error = 1.67412 limit = 0.00010848
Difference in entry (1,0): DPC++ 0.445891 vs. Reference -1.22823
relative error = 1.05343 absolute error = 0.664006 limit = 0.00010848
Difference in entry (2,0): DPC++ 0.0336805 vs. Reference -0.630325
relative error = 1.0674 absolute error = 0.514821 limit = 0.00010848
Difference in entry (3,0): DPC++ 0.0325077 vs. Reference -0.482313
relative error = 0.789876 absolute error = 0.992507 limit = 0.00010848
Difference in entry (4,0): DPC++ -0.264029 vs. Reference -1.25654
relative error = 0.925093 absolute error = 1.07784 limit = 0.00010848

The differences between the output and reference seem too large to be due to a precision issue.

Expected behavior

The tests should pass.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BLAS domainBLAS domain issue/requestbugA request to fix an issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions