Add support for CUBLAS_TENSOR_OP_MATH in fp16 GEMM #13451

nluehr · 2017-10-02T19:48:01Z

Applies to matrix multiplications with fp16 input/output.
Computations will fall back to pseudo-fp16 if tensor op math is
disabled or not supported.
Enabled by default, but can be disabled by setting the environment
variable TF_DISABLE_TENSOR_OP_MATH=1.

tensorflow-jenkins · 2017-10-02T19:48:03Z

Can one of the admins verify this patch?

nluehr · 2017-10-02T19:49:08Z

For review by @zheng-xq.

frankchn · 2017-10-02T21:35:49Z

Jenkins, test this please.

martinwicke · 2017-11-07T23:08:29Z

Jenkins, test this please. (running again in case this was infra)

zheng-xq · 2017-11-10T00:51:53Z

tensorflow/stream_executor/cuda/cuda_blas.cc

 }

 bool CUDABlas::DoBlasAxpy(Stream *stream, uint64 elem_count, float alpha,
                          const DeviceMemory<float> &x, int incx,
                          DeviceMemory<float> *y, int incy) {
  return DoBlasInternal(wrap::cublasSaxpy, stream,
-                        true /* = pointer_mode_host */, elem_count, &alpha,
+                        true /* = pointer_mode_host */,
+                        false /* = use_tensor_ops */, elem_count, &alpha,


For most of the changes here, are they affected by the use_tensor_ops setting?

If not, I'd rather leave the majority alone, and not setting this bit at all. We can have a different DoBlasInternalWithTensorOp(....) do set the use_tensor_op

Second on XQ's comment. FYI, we have done this with DoFusedConvolve() to support fused conv: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/stream_executor/cuda/cuda_dnn.cc#L2461

zheng-xq · 2017-11-10T00:53:57Z

tensorflow/stream_executor/cuda/cuda_blas.cc

+  //
+  // Note that when false is returned, an appropriate error has already been
+  // logged.
+  bool Init(cublasMath_t new_mode) {


Why do we need this? With RAII, can we do this in the constructor?

The separate init function is used to better handle errors. This follows the existing approach used for ScopedCublasPointerMode.

zheng-xq · 2017-11-10T00:54:42Z

tensorflow/stream_executor/cuda/cuda_blas.cc

+
+    ret = wrap::cublasSetMathMode(parent_, handle_, new_mode);
+    if (ret != CUBLAS_STATUS_SUCCESS) {
+      LOG(ERROR) << "failed to set new cublas math mode: " << ToString(ret);


Add a TODO to propagate this error back to callers. TensorFlow wants to fail graceful when this happens.

Propagation is already in place as the callers check for a false return value.

zheng-xq · 2017-11-10T00:57:28Z

tensorflow/stream_executor/cuda/cuda_blas.cc

@@ -545,7 +622,13 @@ bool CUDABlas::DoBlasInternalImpl(FuncT cublas_func, Stream *stream,
                                           : CUBLAS_POINTER_MODE_DEVICE)) {
    return false;
  }
-
+#if CUDA_VERSION >= 9000
+  ScopedCublasMathMode math_mode{parent_, blas_};


If the functions who will be affected by this is small, I'd rather not to call it at all by most of them.

OK. We'll only call init when tensor-ops are needed.

zheng-xq · 2017-11-10T01:01:14Z

tensorflow/stream_executor/cuda/cuda_blas.cc

+                                                                   &cc_minor);
+
+  // GPUs < sm_70 don't support tensor cores
+  if (cc_major >= 7 && TensorOpMathEnabled()) {


I'm not a big fan of parsing env-var on every matmul-calls. Either cache this globally, or have the caller cache it and pass it in.

The problem with a global setting is that you will have to test it separately. Since there is no way to unset it to not affect other tests.

With TF, this can be cached at the op construction time, and therefore save the trouble.

The env-var is only checked on first access. TensorOpMathEnabled() caches the return value in a static variable.

I think you'd have to also add this code to DoBlasGemmWithProfiling() and DoBlasGemmWithAlgorithm() for fp16, so that it can also be used when turning on autotune.

zheng-xq · 2017-11-10T01:05:44Z

tensorflow/stream_executor/cuda/cuda_blas.cc

-      CUDAMemory(a), cuda_in_type, lda, CUDAMemory(b), cuda_in_type, ldb, &beta,
-      CUDAMemoryMutable(c), CUDADataType<OutT>::type, ldc,
-      CUDAComputationType(computation_type),
+      false /* = use_tensor_ops */, CUDABlasTranspose(transa),


This is the part that the matmul autotuning works. Should we add that to part of the autotuning?

DoBlasGemmWithAlgorithmImpl

This feature is currently disabled by default through this env-var "TF_MATMUL_AUTOTUNE_ENABLE", due to large noise. But our plan is to enable that. So it is a good idea to make sure it still works.

I'd suggest that we add tensor_op support here, while include the setting in autotune in TensorFlow's op level by adding ways to propagate use_tensor_ops option to matmul op.

Added cuda 9.0 algos, including tensor ops, to the algorithm list. Also, added checks to DoBlasGemmWithAlgorithmImpl to return false if tensor_op algorithm is requested when not appropriate. These changes enable the auto-tuning path to use tensor-ops.

yzhwang

Hi Nathan, were you able to make some changes to this pull request? After that I can give another pass of review. Thank you!

nluehr · 2017-11-21T18:34:23Z

Just pushed.

yzhwang

Thank you for the changes! See my comments below.

yzhwang · 2017-11-21T22:22:15Z

tensorflow/stream_executor/cuda/cuda_blas.h

+                                   bool pointer_mode_host, bool use_tensor_ops,
+                                   Args... args) {
+    return DoBlasInternalImpl(cublas_func, stream, pointer_mode_host,
+                              /*err_on_failure=*/true, /*use_tensor_ops=*/false,


The name DoBlasInternalWithTensorOps() suggests that use_tensor_ops should be set to true. Also, if we use this function, I think the argument use_tensor_ops is redundant.

Please also change the interfaces of DoBlasGemmWithProfiling() and DoBlasGemmWithAlgorithm() for fp16. Thank you!

Tensor op usage is encoded in the algorithm, so no modification of DoBlasGemmWithAlgorithm is needed.
DoBlasGemmWithProfiling wraps DoBlasGemm and also should not require further change as I understand the code.

Yes you're right, both functions call DoBlasGemm() so nothing should be done in those two functions. Thanks for the correction.

yzhwang · 2017-11-21T22:23:59Z

tensorflow/stream_executor/cuda/cuda_blas.cc

-      CUDABlasTranspose(transa), CUDABlasTranspose(transb), m, n, k, &alpha,
-      CUDAMemory(a), SE_CUDA_DATA_HALF, lda, CUDAMemory(b), SE_CUDA_DATA_HALF,
-      ldb, &beta, CUDAMemoryMutable(c), SE_CUDA_DATA_HALF, ldc);
+      use_tensor_ops, CUDABlasTranspose(transa), CUDABlasTranspose(transb), m,


DoBlasInternalWithTensorOps should set use_tensor_ops to be true, I think we should get rid of use_tensor_ops argument.

It is convenient to pass this as a flag, otherwise I'll have to do an if-else on the flag in order to make different function calls. Alternatively, I could remove DoBlasInternalWithTensorOps entirely and use direct calls to DoBlasInternalImpl. Would that be preferable?

I'm fine with either way. I noticed that later in DoBlasInternalWithTensorOps() you directly called this: return DoBlasInternalImpl(cublas_func, stream, pointer_mode_host, /err_on_failure=/true, /use_tensor_ops=/false, ...) Shouldn't it be /use_tensor_ops=/true?
Also I still think if you are using DoBlasInternalWithTensorOps(), it suggests you are using tensor ops, so probably just get rid of this function?

Yes, that was a bug. Thanks for the catch. DoBlasGemm for fp16 now calls DoBlasInternalImpl directly and DoBlasInternalWithTensorOps has been removed.

yzhwang · 2017-11-21T22:32:02Z

tensorflow/stream_executor/cuda/cuda_blas.cc

+static bool TensorOpMathEnabled() {
+  static bool is_enabled = [] {
+    bool ret;
+    TF_CHECK_OK(tensorflow::ReadBoolFromEnvVar("TF_DISABLE_TENSOR_OP_MATH",


I noticed that in cuda_dnn.cc, there is also an env-var with exactly the same name: TF_DISABLE_TENSOR_OP_MATH. Ideally, we would want each env-var to only control the behavior of one op, this will make debugging easier. Could you consider using two different names such as TF_DISABLE_CUBLAS_TENSOR_OP_MATH and TF_DISABLE_CUDNN_TENSOR_OP_MATH?

TF_DISABLE_TENSOR_OP_MATH continues to disable tensor ops in both cublas and cudnn, but I added TF_DISABLE_CUBLAS_TENSOR_OP_MATH and TF_DISABLE_CUDNN_TENSOR_OP_MATH which will override the above and apply specifically to CUBLAS or CUDNN.

yzhwang · 2017-11-21T22:41:51Z

tensorflow/stream_executor/cuda/cuda_blas.cc

+                                                                   &cc_minor);
+
+  // GPUs < sm_70 don't support tensor cores
+  if (cc_major >= 7 && TensorOpMathEnabled()) {


I think you'd have to also add this code to DoBlasGemmWithProfiling() and DoBlasGemmWithAlgorithm() for fp16, so that it can also be used when turning on autotune.

yzhwang

Please also add description that you have changed GetBlasGemmAlgorithms() to include more internal algorithms for CUDA_VERSION >= 9000.

yzhwang · 2017-11-22T18:53:25Z

tensorflow/stream_executor/cuda/cuda_dnn.cc

+    } else {
+      TF_CHECK_OK(tensorflow::ReadBoolFromEnvVar("TF_DISABLE_TENSOR_OP_MATH",
+                                                 /*default=*/false, &ret));
+    }
    return ret;


Should this also be !ret like in cuda_blas.cc?

nluehr · 2017-11-22T19:36:52Z

Done.

yzhwang

LGTM.

sb2nov · 2017-11-29T21:45:40Z

@zheng-xq PTAL

nluehr · 2017-12-05T21:06:03Z

Ping @zheng-xq. Are there additional changes needed?

zheng-xq

A few more comments

zheng-xq · 2017-12-05T23:03:18Z

tensorflow/stream_executor/cuda/cuda_blas.cc

@@ -299,6 +305,24 @@ static string ToString(cublasStatus_t status) {
  }
 }

+// Decide whether to enable TENSOR_OP_MATH
+static bool TensorOpMathEnabled() {
+  static bool is_enabled = [] {


What's the point of having a lambda and evaluate it right away?

This allows the static is_enabled to be set inline.

zheng-xq · 2017-12-05T23:04:54Z

tensorflow/stream_executor/cuda/cuda_blas.cc

+          tensorflow::ReadBoolFromEnvVar("TF_DISABLE_CUBLAS_TENSOR_OP_MATH",
+                                         /*default=*/false, &is_disabled));
+    } else {
+      TF_CHECK_OK(tensorflow::ReadBoolFromEnvVar("TF_DISABLE_TENSOR_OP_MATH",


In general, I'm not a fan of hierarchical flags, since their relationship is not clearly documented. If the name is confusing, why not just change the other one into "TF_DISABLE_CUDNN_TENSOR_OP_MATH"?

Removed TF_DISABLE_TENSOR_OP_MATH.

zheng-xq · 2017-12-05T23:10:13Z

tensorflow/stream_executor/cuda/cuda_blas.cc

@@ -2049,6 +2152,15 @@ bool CUDABlas::DoBlasGemmWithAlgorithmImpl(
    return false;
  }

+#if CUDA_VERSION >= 9000
+  cublasGemmAlgo_t cublas_algo = static_cast<cublasGemmAlgo_t>(algorithm);
+  if (cublas_algo >= CUBLAS_GEMM_DFALT_TENSOR_OP &&


Why not using "CUBLAS_GEMM_DEFAULT_TENSOR_OP".

Also I'm not a fan of comparing enums. Could you split "cublas_algo >= CUBLAS_GEMM_DFALT_TENSOR_OP" into a function? In fact, I'd prefer to return them one by one... So future changes have to update them.

Changed to CUBLAS_GEMM_DEFAULT_TENSOR_OP and moved inequality check to a separate function.

zheng-xq · 2017-12-05T23:12:03Z

tensorflow/stream_executor/cuda/cuda_blas.cc

-        CUBLAS_GEMM_ALGO2, CUBLAS_GEMM_ALGO3, CUBLAS_GEMM_ALGO4,
-        CUBLAS_GEMM_ALGO5, CUBLAS_GEMM_ALGO6, CUBLAS_GEMM_ALGO7}) {
+  for (cublasGemmAlgo_t algo : {
+         CUBLAS_GEMM_DFALT, CUBLAS_GEMM_ALGO0, CUBLAS_GEMM_ALGO1,


Could you split the list into its own function? So we know to update them, or even auto-gen them from the header file.

I'm not sure what is being suggested here. The current function seems pretty clean to me. Do you envision separate functions for CUBLAS 8 vs. 9 algos, or separate function for tensor op vs. non tensor op algos. Either of these would make the API higher up the stack more complicated.

- Applies to matrix multiplications with fp16 input/output. Computations will fall back to pseudo-fp16 if tensor op math is disabled or not supported. - Enabled by default. Tensor ops (both in cublas gemms and cudnn convolutions) can be disabled globally by setting the environment variable TF_DISABLE_TENSOR_OP_MATH=1. To disable tensor ops specifically for gemms or convolutions use TF_DISABLE_CUBLAS_TENSOR_OP_MATH=1 or TF_DISABLE_CUDNN_TENSOR_OP_MATH=1, respectively. - Added CUBLAS 9.0 algorithms to GetBlasGemmAlgorithms().

zheng-xq · 2017-12-23T02:13:05Z

This looks okay for now. Let's merge it in and improve it later. Thanks for the contribution!

zheng-xq · 2017-12-23T02:13:32Z

LGTM

drpngx · 2017-12-25T23:44:29Z

Jenkins, test this please.

- Applies to matrix multiplications with fp16 input/output. Computations will fall back to pseudo-fp16 if tensor op math is disabled or not supported. - Enabled by default. Tensor ops (both in cublas gemms and cudnn convolutions) can be disabled globally by setting the environment variable TF_DISABLE_TENSOR_OP_MATH=1. To disable tensor ops specifically for gemms or convolutions use TF_DISABLE_CUBLAS_TENSOR_OP_MATH=1 or TF_DISABLE_CUDNN_TENSOR_OP_MATH=1, respectively. - Added CUBLAS 9.0 algorithms to GetBlasGemmAlgorithms().

* Delete empty api_guides. PiperOrigin-RevId: 179215745 * Java: Instructions for using GPUs via Maven. GPU support in Maven is being packaged with 1.5.0-rc0 onwards (for Linux) Fixes #12909 PiperOrigin-RevId: 180859336 * Fix build issues with cuda 9.1 through updating eigen. (#15796) * Revert "Fix the headers error due to recent CUDA9.1 change (#15739)" This reverts commit 3bc4900. * Bump eigen dependency. * Minor change to make tpu.rewrite compatible with Python 3. AttrValue is a byte array, and handling this is different between Python 2 and 3. PiperOrigin-RevId: 180306415 * TensorFlow for NVIDIA Tegra devices with CUDA support (#14167) This commit enables CUDA support on compatible devices running Android such as the Nvidia TX1 and TX2 when using Makefile builds. Note that JetPack for Android is required to build/run Android TF binaries with CUDA support. This should be released by Nvidia in the near future. * Adding cuda_config.h to the pip package. (#15961) * Adding cuda_config headers to our GPU build. * Updating the local cuda path for cuda_headers. * Removing the cuda_config blacklist. * Buildifier fix. * Ignoring .so files and manually adding the cuda_config.h file. * Fixing the path for the src_dir. * One last minor fix for path. * Adding brackets. * Minor fixes for "Linear" tutorial PiperOrigin-RevId: 179061248 * Sync Premade and Custom estimator docs with example code. PiperOrigin-RevId: 179404175 * rename files PiperOrigin-RevId: 179683700 * Modernize old "get_started/get_started.md", as "programmers_guide/low_level_intro.md". PiperOrigin-RevId: 179807033 * Add links to low level API intro PiperOrigin-RevId: 179844300 * Make images larger PiperOrigin-RevId: 181034398 * minor fixes to new "low_level_intro" PiperOrigin-RevId: 181172455 * typo PiperOrigin-RevId: 181185642 * Replace get_started Also add sub-sections to leftnav files, and sync leftnav and index files. PiperOrigin-RevId: 181394206 * Added a "Getting Started with TensorFlow for ML Beginners" chapter to Get Started section. PiperOrigin-RevId: 181396430 * Add support for CUBLAS_TENSOR_OP_MATH in fp16 GEMM (#13451) - Applies to matrix multiplications with fp16 input/output. Computations will fall back to pseudo-fp16 if tensor op math is disabled or not supported. - Enabled by default. Tensor ops (both in cublas gemms and cudnn convolutions) can be disabled globally by setting the environment variable TF_DISABLE_TENSOR_OP_MATH=1. To disable tensor ops specifically for gemms or convolutions use TF_DISABLE_CUBLAS_TENSOR_OP_MATH=1 or TF_DISABLE_CUDNN_TENSOR_OP_MATH=1, respectively. - Added CUBLAS 9.0 algorithms to GetBlasGemmAlgorithms(). * Adding page to tensorflow.org with directions for building the TFLite demo on Android. PiperOrigin-RevId: 179970218

googlebot added the cla: yes label Oct 2, 2017

nluehr force-pushed the gemm-tensor-core branch from e2fd49e to 9492b54 Compare October 2, 2017 20:11

frankchn requested a review from zheng-xq October 2, 2017 21:35

frankchn assigned zheng-xq Oct 2, 2017

frankchn added the awaiting review Pull request awaiting review label Oct 2, 2017

tfboyd requested a review from reedwm October 6, 2017 00:16

martinwicke added kokoro:run and removed awaiting review Pull request awaiting review labels Nov 7, 2017

kokoro-team removed the kokoro:run label Nov 7, 2017

martinwicke added awaiting review Pull request awaiting review stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed awaiting testing (then merge) labels Nov 7, 2017

zheng-xq suggested changes Nov 10, 2017

View reviewed changes

zheng-xq requested review from yzhwang and removed request for reedwm November 10, 2017 01:07

yzhwang reviewed Nov 21, 2017

View reviewed changes

nluehr force-pushed the gemm-tensor-core branch from 9492b54 to b6210e7 Compare November 21, 2017 18:33

yzhwang suggested changes Nov 21, 2017

View reviewed changes

nluehr force-pushed the gemm-tensor-core branch from b6210e7 to 34852ec Compare November 22, 2017 17:48

yzhwang reviewed Nov 22, 2017

View reviewed changes

nluehr force-pushed the gemm-tensor-core branch from 34852ec to 3b00c95 Compare November 22, 2017 19:36

yzhwang approved these changes Nov 22, 2017

View reviewed changes

sb2nov added the kokoro:force-run Tests on submitted change label Nov 29, 2017

kokoro-team removed the kokoro:force-run Tests on submitted change label Nov 29, 2017

zheng-xq reviewed Dec 5, 2017

View reviewed changes

nluehr force-pushed the gemm-tensor-core branch 2 times, most recently from fc65556 to 776842c Compare December 13, 2017 19:19

nluehr force-pushed the gemm-tensor-core branch from 776842c to 80d3e85 Compare December 13, 2017 19:25

zheng-xq approved these changes Dec 23, 2017

View reviewed changes

martinwicke added awaiting testing (then merge) and removed awaiting review Pull request awaiting review stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Dec 23, 2017

kokoro-team removed the kokoro:run label Dec 23, 2017

drpngx added the kokoro:force-run Tests on submitted change label Dec 25, 2017

kokoro-team removed the kokoro:force-run Tests on submitted change label Dec 25, 2017

drpngx merged commit bffa3e1 into tensorflow:master Dec 26, 2017

nluehr deleted the gemm-tensor-core branch January 13, 2021 18:41

Add support for CUBLAS_TENSOR_OP_MATH in fp16 GEMM #13451

Add support for CUBLAS_TENSOR_OP_MATH in fp16 GEMM #13451

Conversation

nluehr commented Oct 2, 2017

tensorflow-jenkins commented Oct 2, 2017

nluehr commented Oct 2, 2017

frankchn commented Oct 2, 2017

martinwicke commented Nov 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yzhwang left a comment

Choose a reason for hiding this comment

nluehr commented Nov 21, 2017

yzhwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yzhwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nluehr commented Nov 22, 2017

yzhwang left a comment

Choose a reason for hiding this comment

sb2nov commented Nov 29, 2017

nluehr commented Dec 5, 2017

zheng-xq left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zheng-xq commented Dec 23, 2017

zheng-xq commented Dec 23, 2017

drpngx commented Dec 25, 2017