Skip to content

Add support for CUBLAS_COMPUTE_16F for GEMM opeartions in cudaBLAS #123157

Open
@gswxp2

Description

@gswxp2

🚀 The feature, motivation and pitch

For Nvidia ADA architecture GPUs like 4090, the performance of the GEMM kernel for fp16 is different between FP16 accumulation(330.3TFLOPS) and FP32 accumulation(165.2TFLOPS). This can be checked in the document. The accumulation mode can be controlled by the cublasComputeType argument when calling cublas functions.

The current implementation in pytorch has an option named torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction(allowFP16ReductionCuBLAS in aten context). As the document suggested, when it is set to true, the accumulation for FP16 GEMM is allowed in FP16 mode.
The check of this option is in:

if (!at::globalContext().allowFP16ReductionCuBLAS()) {
cublas_flags = static_cast<cublasMath_t>(cublas_flags | CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION);
}

Even if the user enabled the allowFP16ReductionCuBLAS operation, the following call to cublasGemmEx will still use fp32 as the accumulation mode(The second to last argument is always CUDA_R_32F).

TORCH_CUDABLAS_CHECK(cublasGemmEx(
handle,
opa,
opb,
m,
n,
k,
&falpha,
a,
CUDA_R_16F,
lda,
b,
CUDA_R_16F,
ldb,
&fbeta,
c,
CUDA_R_16F,
ldc,
CUDA_R_32F,
CUBLAS_GEMM_DEFAULT_TENSOR_OP));

Also, for batched GEMM operation, the cublas compute mode is set to CUBLAS_COMPUTE_32F unconditionally without checking the allowFP16ReductionCuBLAS option.

cublasComputeType_t computeType = CUBLAS_COMPUTE_32F;

This feature can DOUBLE the performance in commodity GPUs like 4090/4080. For users who explicitly enable the torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction, we should use the CUBLAS_COMPUTE_16F as the cublasComputeType in cublas function calls for higher performance.

If this is the desired behavior, I'm very willing to write a patch to support this feature :)

Alternatives

No response

Additional context

No response

cc @ptrblck @csarofeen @xwang233

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNot as big of a feature, but technically not a bug. Should be easy to fixmatrix multiplicationmodule: cublasProblem related to cublas supportmodule: cudaRelated to torch.cuda, and CUDA support in generaltriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions