Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for CUBLAS_COMPUTE_16F for GEMM opeartions in cudaBLAS #123157

Open
gswxp2 opened this issue Apr 2, 2024 · 2 comments
Open

Add support for CUBLAS_COMPUTE_16F for GEMM opeartions in cudaBLAS #123157

gswxp2 opened this issue Apr 2, 2024 · 2 comments
Labels
enhancement Not as big of a feature, but technically not a bug. Should be easy to fix matrix multiplication module: cublas Problem related to cublas support module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@gswxp2
Copy link

gswxp2 commented Apr 2, 2024

馃殌 The feature, motivation and pitch

For Nvidia ADA architecture GPUs like 4090, the performance of the GEMM kernel for fp16 is different between FP16 accumulation(330.3TFLOPS) and FP32 accumulation(165.2TFLOPS). This can be checked in the document. The accumulation mode can be controlled by the cublasComputeType argument when calling cublas functions.

The current implementation in pytorch has an option named torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction(allowFP16ReductionCuBLAS in aten context). As the document suggested, when it is set to true, the accumulation for FP16 GEMM is allowed in FP16 mode.
The check of this option is in:

if (!at::globalContext().allowFP16ReductionCuBLAS()) {
cublas_flags = static_cast<cublasMath_t>(cublas_flags | CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION);
}

Even if the user enabled the allowFP16ReductionCuBLAS operation, the following call to cublasGemmEx will still use fp32 as the accumulation mode(The second to last argument is always CUDA_R_32F).

TORCH_CUDABLAS_CHECK(cublasGemmEx(
handle,
opa,
opb,
m,
n,
k,
&falpha,
a,
CUDA_R_16F,
lda,
b,
CUDA_R_16F,
ldb,
&fbeta,
c,
CUDA_R_16F,
ldc,
CUDA_R_32F,
CUBLAS_GEMM_DEFAULT_TENSOR_OP));

Also, for batched GEMM operation, the cublas compute mode is set to CUBLAS_COMPUTE_32F unconditionally without checking the allowFP16ReductionCuBLAS option.

cublasComputeType_t computeType = CUBLAS_COMPUTE_32F;

This feature can DOUBLE the performance in commodity GPUs like 4090/4080. For users who explicitly enable the torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction, we should use the CUBLAS_COMPUTE_16F as the cublasComputeType in cublas function calls for higher performance.

If this is the desired behavior, I'm very willing to write a patch to support this feature :)

Alternatives

No response

Additional context

No response

cc @ptrblck @csarofeen @xwang233

@gswxp2 gswxp2 changed the title Add support for CUBLAS_COMPUTE_16F for gemm opeartions in cudaBLAS Add support for CUBLAS_COMPUTE_16F for GEMM opeartions in cudaBLAS Apr 2, 2024
@gswxp2
Copy link
Author

gswxp2 commented Apr 2, 2024

Demo benchmark code here:

import torch
print(torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction)
x = torch.randn(4096, 4096, device="cuda:0", dtype=torch.float16)
y = torch.randn(4096, 4096, device="cuda:0", dtype=torch.float16)
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
# warmup 
for _ in range(1000):
    out = x.mm(y)
start.record()
for _ in range(1000):
    out = x.mm(y)
end.record()
end.synchronize()
time = start.elapsed_time(end)
print(f"TFLOPs {4096*4096*4096*2/1e9/(time/1000)}")

The output on my 4090 machine is:

True
TFLOPs 168.63091825340712

We can find that the allow_fp16_reduced_precision_reduction option is defaultly turned on. But the FLOPs is the same as the performance with fp32 accumulation in the document (165.2TFLOPS).

@jbschlosser jbschlosser added module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module enhancement Not as big of a feature, but technically not a bug. Should be easy to fix module: cublas Problem related to cublas support matrix multiplication labels Apr 2, 2024
@vadimkantorov
Copy link
Contributor

Also, regarding controls, I think it's better to have a non-global way of setting these options for a given gemm call - either with with statement, or maybe even better also allowing some explicit hints= arg: #52439

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Not as big of a feature, but technically not a bug. Should be easy to fix matrix multiplication module: cublas Problem related to cublas support module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants