Add support for CUBLAS_COMPUTE_16F for GEMM opeartions in cudaBLAS

### 🚀 The feature, motivation and pitch

For Nvidia ADA architecture GPUs like 4090, the performance of the GEMM kernel for fp16 is different between FP16 accumulation(330.3TFLOPS) and FP32 accumulation(165.2TFLOPS). This can be checked in the [document](https://images.nvidia.com/aem-dam/Solutions/Data-Center/l4/nvidia-ada-gpu-architecture-whitepaper-v2.1.pdf). The accumulation mode can be controlled by the cublasComputeType argument when calling cublas functions.

The current implementation in pytorch has an option named `torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction`(`allowFP16ReductionCuBLAS` in aten context). As the document suggested, when it is set to true, the accumulation for FP16 GEMM is allowed in FP16 mode. 
The check of this option is in:[https://github.com/pytorch/pytorch/blob/39901f229520a5256505ec24782f716ee7ddc843/aten/src/ATen/cuda/CUDABlas.cpp#L454-L456](https://github.com/pytorch/pytorch/blob/39901f229520a5256505ec24782f716ee7ddc843/aten/src/ATen/cuda/CUDABlas.cpp#L454-L456)


Even if the user enabled the `allowFP16ReductionCuBLAS` operation, the following call to `cublasGemmEx` will still use fp32 as the accumulation mode(The second to last argument is always `CUDA_R_32F`).[https://github.com/pytorch/pytorch/blob/39901f229520a5256505ec24782f716ee7ddc843/aten/src/ATen/cuda/CUDABlas.cpp#L460-L479](https://github.com/pytorch/pytorch/blob/39901f229520a5256505ec24782f716ee7ddc843/aten/src/ATen/cuda/CUDABlas.cpp#L460-L479)

Also, for batched GEMM operation, the cublas compute mode is set to `CUBLAS_COMPUTE_32F` unconditionally without checking the `allowFP16ReductionCuBLAS` option.
[https://github.com/pytorch/pytorch/blob/39901f229520a5256505ec24782f716ee7ddc843/aten/src/ATen/cuda/CUDABlas.cpp#L645](https://github.com/pytorch/pytorch/blob/39901f229520a5256505ec24782f716ee7ddc843/aten/src/ATen/cuda/CUDABlas.cpp#L645)


This feature can **DOUBLE** the performance in commodity GPUs like 4090/4080. For users who explicitly enable the `torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction`, we should use the `CUBLAS_COMPUTE_16F` as the cublasComputeType in cublas function calls for higher performance. 

If this is the desired behavior, I'm very willing to write a patch to support this feature :)

### Alternatives

_No response_

### Additional context

_No response_

cc @ptrblck @csarofeen @xwang233

	TORCH_CUDABLAS_CHECK(cublasGemmEx(
	handle,
	opa,
	opb,
	m,
	n,
	k,
	&falpha,
	a,
	CUDA_R_16F,
	lda,
	b,
	CUDA_R_16F,
	ldb,
	&fbeta,
	c,
	CUDA_R_16F,
	ldc,
	CUDA_R_32F,
	CUBLAS_GEMM_DEFAULT_TENSOR_OP));

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for CUBLAS_COMPUTE_16F for GEMM opeartions in cudaBLAS #123157

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	if (!at::globalContext().allowFP16ReductionCuBLAS()) {
	cublas_flags = static_cast<cublasMath_t>(cublas_flags \| CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION);
	}

Add support for CUBLAS_COMPUTE_16F for GEMM opeartions in cudaBLAS #123157

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions