Open
Description
Summary
cuBLAS tests running Gemm with half precision can fail with wrong results on A100.
Version
Using the tip of develop as of today (6923d40).
Environment
Using A100 with the DPC++ release 2024.2.0 and the associated Codeplay Nvidia plugin. The CUDA version is 12.6.2, OS is Ubuntu 22.04.
Steps to reproduce
cmake -Bbuild-a100 -GNinja -DCMAKE_CXX_COMPILER=`which icpx` -DENABLE_MKLCPU_BACKEND=OFF -DENABLE_MKLGPU_BACKEND=OFF -DENABLE_CUBLAS_BACKEND=ON -DENABLE_CURAND_BACKEND=ON -DENABLE_CUSOLVER_BACKEND=ON -DENABLE_CUFFT_BACKEND=ON -DREF_BLAS_ROOT=/path/to/lapack/install -DREF_LAPACK_ROOT=/path/to/lapack/install .
cd build-a100
ninja
ctest -R ".*GemmUsmTests.*Half.*" --output-on-failure
Observed behavior
Full log: log_a100.txt
Short extract:
[ RUN ] GemmUsmTestSuite/GemmUsmTests.HalfHalfFloatPrecision/Column_Major_NVIDIA_A100_PCIE_40GB
relative error = 0.496206 absolute error = 0.382722 limit = 0.00010848
Difference in entry (0,0): DPC++ 0.388574 vs. Reference 0.771296
relative error = 1.36303 absolute error = 1.67412 limit = 0.00010848
Difference in entry (1,0): DPC++ 0.445891 vs. Reference -1.22823
relative error = 1.05343 absolute error = 0.664006 limit = 0.00010848
Difference in entry (2,0): DPC++ 0.0336805 vs. Reference -0.630325
relative error = 1.0674 absolute error = 0.514821 limit = 0.00010848
Difference in entry (3,0): DPC++ 0.0325077 vs. Reference -0.482313
relative error = 0.789876 absolute error = 0.992507 limit = 0.00010848
Difference in entry (4,0): DPC++ -0.264029 vs. Reference -1.25654
relative error = 0.925093 absolute error = 1.07784 limit = 0.00010848
The differences between the output and reference seem too large to be due to a precision issue.
Expected behavior
The tests should pass.