-
Notifications
You must be signed in to change notification settings - Fork 74.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for CUBLAS_TENSOR_OP_MATH in fp16 GEMM #13451
Conversation
Can one of the admins verify this patch? |
For review by @zheng-xq. |
e2fd49e
to
9492b54
Compare
Jenkins, test this please. |
Jenkins, test this please. (running again in case this was infra) |
} | ||
|
||
bool CUDABlas::DoBlasAxpy(Stream *stream, uint64 elem_count, float alpha, | ||
const DeviceMemory<float> &x, int incx, | ||
DeviceMemory<float> *y, int incy) { | ||
return DoBlasInternal(wrap::cublasSaxpy, stream, | ||
true /* = pointer_mode_host */, elem_count, &alpha, | ||
true /* = pointer_mode_host */, | ||
false /* = use_tensor_ops */, elem_count, &alpha, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For most of the changes here, are they affected by the use_tensor_ops setting?
If not, I'd rather leave the majority alone, and not setting this bit at all. We can have a different DoBlasInternalWithTensorOp(....) do set the use_tensor_op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second on XQ's comment. FYI, we have done this with DoFusedConvolve() to support fused conv: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/stream_executor/cuda/cuda_dnn.cc#L2461
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// | ||
// Note that when false is returned, an appropriate error has already been | ||
// logged. | ||
bool Init(cublasMath_t new_mode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? With RAII, can we do this in the constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The separate init function is used to better handle errors. This follows the existing approach used for ScopedCublasPointerMode.
|
||
ret = wrap::cublasSetMathMode(parent_, handle_, new_mode); | ||
if (ret != CUBLAS_STATUS_SUCCESS) { | ||
LOG(ERROR) << "failed to set new cublas math mode: " << ToString(ret); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a TODO to propagate this error back to callers. TensorFlow wants to fail graceful when this happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Propagation is already in place as the callers check for a false return value.
@@ -545,7 +622,13 @@ bool CUDABlas::DoBlasInternalImpl(FuncT cublas_func, Stream *stream, | |||
: CUBLAS_POINTER_MODE_DEVICE)) { | |||
return false; | |||
} | |||
|
|||
#if CUDA_VERSION >= 9000 | |||
ScopedCublasMathMode math_mode{parent_, blas_}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the functions who will be affected by this is small, I'd rather not to call it at all by most of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. We'll only call init when tensor-ops are needed.
&cc_minor); | ||
|
||
// GPUs < sm_70 don't support tensor cores | ||
if (cc_major >= 7 && TensorOpMathEnabled()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of parsing env-var on every matmul-calls. Either cache this globally, or have the caller cache it and pass it in.
The problem with a global setting is that you will have to test it separately. Since there is no way to unset it to not affect other tests.
With TF, this can be cached at the op construction time, and therefore save the trouble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The env-var is only checked on first access. TensorOpMathEnabled() caches the return value in a static variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'd have to also add this code to DoBlasGemmWithProfiling() and DoBlasGemmWithAlgorithm() for fp16, so that it can also be used when turning on autotune.
CUDAMemory(a), cuda_in_type, lda, CUDAMemory(b), cuda_in_type, ldb, &beta, | ||
CUDAMemoryMutable(c), CUDADataType<OutT>::type, ldc, | ||
CUDAComputationType(computation_type), | ||
false /* = use_tensor_ops */, CUDABlasTranspose(transa), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the part that the matmul autotuning works. Should we add that to part of the autotuning?
DoBlasGemmWithAlgorithmImpl
This feature is currently disabled by default through this env-var "TF_MATMUL_AUTOTUNE_ENABLE", due to large noise. But our plan is to enable that. So it is a good idea to make sure it still works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest that we add tensor_op support here, while include the setting in autotune in TensorFlow's op level by adding ways to propagate use_tensor_ops option to matmul op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added cuda 9.0 algos, including tensor ops, to the algorithm list. Also, added checks to DoBlasGemmWithAlgorithmImpl to return false if tensor_op algorithm is requested when not appropriate. These changes enable the auto-tuning path to use tensor-ops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Nathan, were you able to make some changes to this pull request? After that I can give another pass of review. Thank you!
9492b54
to
b6210e7
Compare
Just pushed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the changes! See my comments below.
bool pointer_mode_host, bool use_tensor_ops, | ||
Args... args) { | ||
return DoBlasInternalImpl(cublas_func, stream, pointer_mode_host, | ||
/*err_on_failure=*/true, /*use_tensor_ops=*/false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name DoBlasInternalWithTensorOps() suggests that use_tensor_ops should be set to true. Also, if we use this function, I think the argument use_tensor_ops is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also change the interfaces of DoBlasGemmWithProfiling() and DoBlasGemmWithAlgorithm() for fp16. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tensor op usage is encoded in the algorithm, so no modification of DoBlasGemmWithAlgorithm is needed.
DoBlasGemmWithProfiling wraps DoBlasGemm and also should not require further change as I understand the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you're right, both functions call DoBlasGemm() so nothing should be done in those two functions. Thanks for the correction.
CUDABlasTranspose(transa), CUDABlasTranspose(transb), m, n, k, &alpha, | ||
CUDAMemory(a), SE_CUDA_DATA_HALF, lda, CUDAMemory(b), SE_CUDA_DATA_HALF, | ||
ldb, &beta, CUDAMemoryMutable(c), SE_CUDA_DATA_HALF, ldc); | ||
use_tensor_ops, CUDABlasTranspose(transa), CUDABlasTranspose(transb), m, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DoBlasInternalWithTensorOps should set use_tensor_ops to be true, I think we should get rid of use_tensor_ops argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is convenient to pass this as a flag, otherwise I'll have to do an if-else on the flag in order to make different function calls. Alternatively, I could remove DoBlasInternalWithTensorOps entirely and use direct calls to DoBlasInternalImpl. Would that be preferable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with either way. I noticed that later in DoBlasInternalWithTensorOps() you directly called this: return DoBlasInternalImpl(cublas_func, stream, pointer_mode_host, /err_on_failure=/true, /use_tensor_ops=/false, ...) Shouldn't it be /use_tensor_ops=/true?
Also I still think if you are using DoBlasInternalWithTensorOps(), it suggests you are using tensor ops, so probably just get rid of this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was a bug. Thanks for the catch. DoBlasGemm for fp16 now calls DoBlasInternalImpl directly and DoBlasInternalWithTensorOps has been removed.
static bool TensorOpMathEnabled() { | ||
static bool is_enabled = [] { | ||
bool ret; | ||
TF_CHECK_OK(tensorflow::ReadBoolFromEnvVar("TF_DISABLE_TENSOR_OP_MATH", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that in cuda_dnn.cc, there is also an env-var with exactly the same name: TF_DISABLE_TENSOR_OP_MATH. Ideally, we would want each env-var to only control the behavior of one op, this will make debugging easier. Could you consider using two different names such as TF_DISABLE_CUBLAS_TENSOR_OP_MATH and TF_DISABLE_CUDNN_TENSOR_OP_MATH?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TF_DISABLE_TENSOR_OP_MATH continues to disable tensor ops in both cublas and cudnn, but I added TF_DISABLE_CUBLAS_TENSOR_OP_MATH and TF_DISABLE_CUDNN_TENSOR_OP_MATH which will override the above and apply specifically to CUBLAS or CUDNN.
&cc_minor); | ||
|
||
// GPUs < sm_70 don't support tensor cores | ||
if (cc_major >= 7 && TensorOpMathEnabled()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'd have to also add this code to DoBlasGemmWithProfiling() and DoBlasGemmWithAlgorithm() for fp16, so that it can also be used when turning on autotune.
b6210e7
to
34852ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add description that you have changed GetBlasGemmAlgorithms() to include more internal algorithms for CUDA_VERSION >= 9000.
} else { | ||
TF_CHECK_OK(tensorflow::ReadBoolFromEnvVar("TF_DISABLE_TENSOR_OP_MATH", | ||
/*default=*/false, &ret)); | ||
} | ||
return ret; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also be !ret like in cuda_blas.cc?
34852ec
to
3b00c95
Compare
Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@zheng-xq PTAL |
Ping @zheng-xq. Are there additional changes needed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments
@@ -299,6 +305,24 @@ static string ToString(cublasStatus_t status) { | |||
} | |||
} | |||
|
|||
// Decide whether to enable TENSOR_OP_MATH | |||
static bool TensorOpMathEnabled() { | |||
static bool is_enabled = [] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of having a lambda and evaluate it right away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows the static is_enabled to be set inline.
tensorflow::ReadBoolFromEnvVar("TF_DISABLE_CUBLAS_TENSOR_OP_MATH", | ||
/*default=*/false, &is_disabled)); | ||
} else { | ||
TF_CHECK_OK(tensorflow::ReadBoolFromEnvVar("TF_DISABLE_TENSOR_OP_MATH", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I'm not a fan of hierarchical flags, since their relationship is not clearly documented. If the name is confusing, why not just change the other one into "TF_DISABLE_CUDNN_TENSOR_OP_MATH"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed TF_DISABLE_TENSOR_OP_MATH.
@@ -2049,6 +2152,15 @@ bool CUDABlas::DoBlasGemmWithAlgorithmImpl( | |||
return false; | |||
} | |||
|
|||
#if CUDA_VERSION >= 9000 | |||
cublasGemmAlgo_t cublas_algo = static_cast<cublasGemmAlgo_t>(algorithm); | |||
if (cublas_algo >= CUBLAS_GEMM_DFALT_TENSOR_OP && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using "CUBLAS_GEMM_DEFAULT_TENSOR_OP".
Also I'm not a fan of comparing enums. Could you split "cublas_algo >= CUBLAS_GEMM_DFALT_TENSOR_OP" into a function? In fact, I'd prefer to return them one by one... So future changes have to update them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to CUBLAS_GEMM_DEFAULT_TENSOR_OP and moved inequality check to a separate function.
CUBLAS_GEMM_ALGO2, CUBLAS_GEMM_ALGO3, CUBLAS_GEMM_ALGO4, | ||
CUBLAS_GEMM_ALGO5, CUBLAS_GEMM_ALGO6, CUBLAS_GEMM_ALGO7}) { | ||
for (cublasGemmAlgo_t algo : { | ||
CUBLAS_GEMM_DFALT, CUBLAS_GEMM_ALGO0, CUBLAS_GEMM_ALGO1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you split the list into its own function? So we know to update them, or even auto-gen them from the header file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what is being suggested here. The current function seems pretty clean to me. Do you envision separate functions for CUBLAS 8 vs. 9 algos, or separate function for tensor op vs. non tensor op algos. Either of these would make the API higher up the stack more complicated.
fc65556
to
776842c
Compare
- Applies to matrix multiplications with fp16 input/output. Computations will fall back to pseudo-fp16 if tensor op math is disabled or not supported. - Enabled by default. Tensor ops (both in cublas gemms and cudnn convolutions) can be disabled globally by setting the environment variable TF_DISABLE_TENSOR_OP_MATH=1. To disable tensor ops specifically for gemms or convolutions use TF_DISABLE_CUBLAS_TENSOR_OP_MATH=1 or TF_DISABLE_CUDNN_TENSOR_OP_MATH=1, respectively. - Added CUBLAS 9.0 algorithms to GetBlasGemmAlgorithms().
776842c
to
80d3e85
Compare
This looks okay for now. Let's merge it in and improve it later. Thanks for the contribution! |
LGTM |
Jenkins, test this please. |
- Applies to matrix multiplications with fp16 input/output. Computations will fall back to pseudo-fp16 if tensor op math is disabled or not supported. - Enabled by default. Tensor ops (both in cublas gemms and cudnn convolutions) can be disabled globally by setting the environment variable TF_DISABLE_TENSOR_OP_MATH=1. To disable tensor ops specifically for gemms or convolutions use TF_DISABLE_CUBLAS_TENSOR_OP_MATH=1 or TF_DISABLE_CUDNN_TENSOR_OP_MATH=1, respectively. - Added CUBLAS 9.0 algorithms to GetBlasGemmAlgorithms().
* Delete empty api_guides. PiperOrigin-RevId: 179215745 * Java: Instructions for using GPUs via Maven. GPU support in Maven is being packaged with 1.5.0-rc0 onwards (for Linux) Fixes #12909 PiperOrigin-RevId: 180859336 * Fix build issues with cuda 9.1 through updating eigen. (#15796) * Revert "Fix the headers error due to recent CUDA9.1 change (#15739)" This reverts commit 3bc4900. * Bump eigen dependency. * Minor change to make tpu.rewrite compatible with Python 3. AttrValue is a byte array, and handling this is different between Python 2 and 3. PiperOrigin-RevId: 180306415 * TensorFlow for NVIDIA Tegra devices with CUDA support (#14167) This commit enables CUDA support on compatible devices running Android such as the Nvidia TX1 and TX2 when using Makefile builds. Note that JetPack for Android is required to build/run Android TF binaries with CUDA support. This should be released by Nvidia in the near future. * Adding cuda_config.h to the pip package. (#15961) * Adding cuda_config headers to our GPU build. * Updating the local cuda path for cuda_headers. * Removing the cuda_config blacklist. * Buildifier fix. * Ignoring .so files and manually adding the cuda_config.h file. * Fixing the path for the src_dir. * One last minor fix for path. * Adding brackets. * Minor fixes for "Linear" tutorial PiperOrigin-RevId: 179061248 * Sync Premade and Custom estimator docs with example code. PiperOrigin-RevId: 179404175 * rename files PiperOrigin-RevId: 179683700 * Modernize old "get_started/get_started.md", as "programmers_guide/low_level_intro.md". PiperOrigin-RevId: 179807033 * Add links to low level API intro PiperOrigin-RevId: 179844300 * Make images larger PiperOrigin-RevId: 181034398 * minor fixes to new "low_level_intro" PiperOrigin-RevId: 181172455 * typo PiperOrigin-RevId: 181185642 * Replace get_started Also add sub-sections to leftnav files, and sync leftnav and index files. PiperOrigin-RevId: 181394206 * Added a "Getting Started with TensorFlow for ML Beginners" chapter to Get Started section. PiperOrigin-RevId: 181396430 * Add support for CUBLAS_TENSOR_OP_MATH in fp16 GEMM (#13451) - Applies to matrix multiplications with fp16 input/output. Computations will fall back to pseudo-fp16 if tensor op math is disabled or not supported. - Enabled by default. Tensor ops (both in cublas gemms and cudnn convolutions) can be disabled globally by setting the environment variable TF_DISABLE_TENSOR_OP_MATH=1. To disable tensor ops specifically for gemms or convolutions use TF_DISABLE_CUBLAS_TENSOR_OP_MATH=1 or TF_DISABLE_CUDNN_TENSOR_OP_MATH=1, respectively. - Added CUBLAS 9.0 algorithms to GetBlasGemmAlgorithms(). * Adding page to tensorflow.org with directions for building the TFLite demo on Android. PiperOrigin-RevId: 179970218
Computations will fall back to pseudo-fp16 if tensor op math is
disabled or not supported.
variable TF_DISABLE_TENSOR_OP_MATH=1.