-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance of hipblasHgemm #534
Comments
Hi @mathbird, You're correct in reading that the theoretical peak performance for MI250X for Fp16 is ~383 TFLOPs. This comes from the following calculation: hipblasHgemm() will not be able to get this performance, but you should be able to get substantially better performance if you use the mixed precision hipblasGemmEx(...) function. With FP16 input/output and FP32 compute, you should see performance more in line with what is expected. An example call would be as follows:
You can take a look at the gemmEx documentation in hipblas.h, or feel free to ask any questions you have and I'll be happy to help. Thanks, |
Thanks, Daine. I did observe the big performance improvement with GemmEX! But why the computeType "fp32_type" is used for hgemm? if I changed it back to fp16_type, the performance will be reduced to 38 TFLOPs again. Can SGEMM and CGEMM do the mix-precision calculation, too? do you have a complete and detail table to explain what type combinations can be used together for A/B/C, just like cublasGemmEx? |
Thanks for the question.
Following Daine's comments, there are two categories of gemm functions: High-precision accumulate (HPA) functions, where the compute type is different (and more precise) from the data type, and non high-precision accumulation functions, where all data types are the same. For fp16 data type or hgemm, where the input and output data types are fp16, you have two options:
No. The HPA functions are available only if the input data type is fp16/bf16/int8.
Yes, please refer to 3.7.1 rocblas-bench section of the updated rocBLAS user guide. I recently added a table with this information. |
Thanks for useful info. If d_A, d_B and d_C are all defined as float (32 bit) using hipMalloc, the following GemmEx call can convert d_A and d_B correctly as bf16 numbers, and result in the right d_C?
|
No, you should ensure that each datatype provided matches the data stored in the corresponding matrix. The pointers will just be casted to the specified datatype (see rocblas_gemm_ex_template() and gemm_ex_typecasting() if you're interested in the code for the rocBLAS backend). |
@mathbird Do you have any further questions or should we go ahead and close this issue? |
I tried the following type combinations, it could be compiled, but the test showed the error "rocBLAS error: HIPBLAS_STATUS_INVALID_ENUM". did I miss anything? or the complex single precision gemm could only use "HIPBLAS_C_32F" input on MI250?
|
That error code is incorrect, I see we are missing some type conversions in hipBLAS. I'll make a PR for that today. However, that type combination isn't supported in rocBLAS or cuBLAS, so once the above is fixed, it will just return HIPBLAS_STATUS_NOT_SUPPORTED. What you have is "bfloat16 complex" for A/B and "float complex" for C/Compute. For the rocBLAS backend, our only support for HIPBLAS_C_32F in GemmEx is essentially the same as hipblasCgemm(), i.e. |
Thanks. no other questions |
I measured the performance of hipblasHgemm using _Float16 on MI 250. here are what I got:
N=K=M=8192: MI250: ~38 TFLOPs. Is it right? or I missed anything?
https://www.amd.com/en/products/server-accelerators/instinct-mi250x said that "Peak Half Precision (FP16) Performance is 383 TFLOPs"
The text was updated successfully, but these errors were encountered: