Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for CUBLAS_TENSOR_OP_MATH in fp16 GEMM #13451

Merged
merged 1 commit into from
Dec 26, 2017

Conversation

nluehr
Copy link
Contributor

@nluehr nluehr commented Oct 2, 2017

  • Applies to matrix multiplications with fp16 input/output.
    Computations will fall back to pseudo-fp16 if tensor op math is
    disabled or not supported.
  • Enabled by default, but can be disabled by setting the environment
    variable TF_DISABLE_TENSOR_OP_MATH=1.

@tensorflow-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

@nluehr
Copy link
Contributor Author

nluehr commented Oct 2, 2017

For review by @zheng-xq.

@frankchn frankchn requested a review from zheng-xq October 2, 2017 21:35
@frankchn frankchn added the awaiting review Pull request awaiting review label Oct 2, 2017
@frankchn
Copy link
Contributor

frankchn commented Oct 2, 2017

Jenkins, test this please.

@tfboyd tfboyd requested a review from reedwm October 6, 2017 00:16
@martinwicke
Copy link
Member

Jenkins, test this please. (running again in case this was infra)

@martinwicke martinwicke added kokoro:run and removed awaiting review Pull request awaiting review labels Nov 7, 2017
@martinwicke martinwicke added awaiting review Pull request awaiting review stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed awaiting testing (then merge) labels Nov 7, 2017
}

bool CUDABlas::DoBlasAxpy(Stream *stream, uint64 elem_count, float alpha,
const DeviceMemory<float> &x, int incx,
DeviceMemory<float> *y, int incy) {
return DoBlasInternal(wrap::cublasSaxpy, stream,
true /* = pointer_mode_host */, elem_count, &alpha,
true /* = pointer_mode_host */,
false /* = use_tensor_ops */, elem_count, &alpha,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For most of the changes here, are they affected by the use_tensor_ops setting?

If not, I'd rather leave the majority alone, and not setting this bit at all. We can have a different DoBlasInternalWithTensorOp(....) do set the use_tensor_op

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second on XQ's comment. FYI, we have done this with DoFusedConvolve() to support fused conv: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/stream_executor/cuda/cuda_dnn.cc#L2461

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

//
// Note that when false is returned, an appropriate error has already been
// logged.
bool Init(cublasMath_t new_mode) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? With RAII, can we do this in the constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The separate init function is used to better handle errors. This follows the existing approach used for ScopedCublasPointerMode.


ret = wrap::cublasSetMathMode(parent_, handle_, new_mode);
if (ret != CUBLAS_STATUS_SUCCESS) {
LOG(ERROR) << "failed to set new cublas math mode: " << ToString(ret);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a TODO to propagate this error back to callers. TensorFlow wants to fail graceful when this happens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Propagation is already in place as the callers check for a false return value.

@@ -545,7 +622,13 @@ bool CUDABlas::DoBlasInternalImpl(FuncT cublas_func, Stream *stream,
: CUBLAS_POINTER_MODE_DEVICE)) {
return false;
}

#if CUDA_VERSION >= 9000
ScopedCublasMathMode math_mode{parent_, blas_};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the functions who will be affected by this is small, I'd rather not to call it at all by most of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. We'll only call init when tensor-ops are needed.

&cc_minor);

// GPUs < sm_70 don't support tensor cores
if (cc_major >= 7 && TensorOpMathEnabled()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of parsing env-var on every matmul-calls. Either cache this globally, or have the caller cache it and pass it in.

The problem with a global setting is that you will have to test it separately. Since there is no way to unset it to not affect other tests.

With TF, this can be cached at the op construction time, and therefore save the trouble.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The env-var is only checked on first access. TensorOpMathEnabled() caches the return value in a static variable.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'd have to also add this code to DoBlasGemmWithProfiling() and DoBlasGemmWithAlgorithm() for fp16, so that it can also be used when turning on autotune.

CUDAMemory(a), cuda_in_type, lda, CUDAMemory(b), cuda_in_type, ldb, &beta,
CUDAMemoryMutable(c), CUDADataType<OutT>::type, ldc,
CUDAComputationType(computation_type),
false /* = use_tensor_ops */, CUDABlasTranspose(transa),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the part that the matmul autotuning works. Should we add that to part of the autotuning?

DoBlasGemmWithAlgorithmImpl

This feature is currently disabled by default through this env-var "TF_MATMUL_AUTOTUNE_ENABLE", due to large noise. But our plan is to enable that. So it is a good idea to make sure it still works.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest that we add tensor_op support here, while include the setting in autotune in TensorFlow's op level by adding ways to propagate use_tensor_ops option to matmul op.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added cuda 9.0 algos, including tensor ops, to the algorithm list. Also, added checks to DoBlasGemmWithAlgorithmImpl to return false if tensor_op algorithm is requested when not appropriate. These changes enable the auto-tuning path to use tensor-ops.

@zheng-xq zheng-xq requested review from yzhwang and removed request for reedwm November 10, 2017 01:07
Copy link

@yzhwang yzhwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Nathan, were you able to make some changes to this pull request? After that I can give another pass of review. Thank you!

@nluehr
Copy link
Contributor Author

nluehr commented Nov 21, 2017

Just pushed.

Copy link

@yzhwang yzhwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes! See my comments below.

bool pointer_mode_host, bool use_tensor_ops,
Args... args) {
return DoBlasInternalImpl(cublas_func, stream, pointer_mode_host,
/*err_on_failure=*/true, /*use_tensor_ops=*/false,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name DoBlasInternalWithTensorOps() suggests that use_tensor_ops should be set to true. Also, if we use this function, I think the argument use_tensor_ops is redundant.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also change the interfaces of DoBlasGemmWithProfiling() and DoBlasGemmWithAlgorithm() for fp16. Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tensor op usage is encoded in the algorithm, so no modification of DoBlasGemmWithAlgorithm is needed.
DoBlasGemmWithProfiling wraps DoBlasGemm and also should not require further change as I understand the code.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're right, both functions call DoBlasGemm() so nothing should be done in those two functions. Thanks for the correction.

CUDABlasTranspose(transa), CUDABlasTranspose(transb), m, n, k, &alpha,
CUDAMemory(a), SE_CUDA_DATA_HALF, lda, CUDAMemory(b), SE_CUDA_DATA_HALF,
ldb, &beta, CUDAMemoryMutable(c), SE_CUDA_DATA_HALF, ldc);
use_tensor_ops, CUDABlasTranspose(transa), CUDABlasTranspose(transb), m,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DoBlasInternalWithTensorOps should set use_tensor_ops to be true, I think we should get rid of use_tensor_ops argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is convenient to pass this as a flag, otherwise I'll have to do an if-else on the flag in order to make different function calls. Alternatively, I could remove DoBlasInternalWithTensorOps entirely and use direct calls to DoBlasInternalImpl. Would that be preferable?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with either way. I noticed that later in DoBlasInternalWithTensorOps() you directly called this: return DoBlasInternalImpl(cublas_func, stream, pointer_mode_host, /err_on_failure=/true, /use_tensor_ops=/false, ...) Shouldn't it be /use_tensor_ops=/true?
Also I still think if you are using DoBlasInternalWithTensorOps(), it suggests you are using tensor ops, so probably just get rid of this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was a bug. Thanks for the catch. DoBlasGemm for fp16 now calls DoBlasInternalImpl directly and DoBlasInternalWithTensorOps has been removed.

static bool TensorOpMathEnabled() {
static bool is_enabled = [] {
bool ret;
TF_CHECK_OK(tensorflow::ReadBoolFromEnvVar("TF_DISABLE_TENSOR_OP_MATH",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that in cuda_dnn.cc, there is also an env-var with exactly the same name: TF_DISABLE_TENSOR_OP_MATH. Ideally, we would want each env-var to only control the behavior of one op, this will make debugging easier. Could you consider using two different names such as TF_DISABLE_CUBLAS_TENSOR_OP_MATH and TF_DISABLE_CUDNN_TENSOR_OP_MATH?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TF_DISABLE_TENSOR_OP_MATH continues to disable tensor ops in both cublas and cudnn, but I added TF_DISABLE_CUBLAS_TENSOR_OP_MATH and TF_DISABLE_CUDNN_TENSOR_OP_MATH which will override the above and apply specifically to CUBLAS or CUDNN.

&cc_minor);

// GPUs < sm_70 don't support tensor cores
if (cc_major >= 7 && TensorOpMathEnabled()) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'd have to also add this code to DoBlasGemmWithProfiling() and DoBlasGemmWithAlgorithm() for fp16, so that it can also be used when turning on autotune.

Copy link

@yzhwang yzhwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add description that you have changed GetBlasGemmAlgorithms() to include more internal algorithms for CUDA_VERSION >= 9000.

} else {
TF_CHECK_OK(tensorflow::ReadBoolFromEnvVar("TF_DISABLE_TENSOR_OP_MATH",
/*default=*/false, &ret));
}
return ret;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be !ret like in cuda_blas.cc?

@nluehr
Copy link
Contributor Author

nluehr commented Nov 22, 2017

Done.

Copy link

@yzhwang yzhwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@sb2nov sb2nov added the kokoro:force-run Tests on submitted change label Nov 29, 2017
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Nov 29, 2017
@sb2nov
Copy link
Contributor

sb2nov commented Nov 29, 2017

@zheng-xq PTAL

@nluehr
Copy link
Contributor Author

nluehr commented Dec 5, 2017

Ping @zheng-xq. Are there additional changes needed?

Copy link
Contributor

@zheng-xq zheng-xq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments

@@ -299,6 +305,24 @@ static string ToString(cublasStatus_t status) {
}
}

// Decide whether to enable TENSOR_OP_MATH
static bool TensorOpMathEnabled() {
static bool is_enabled = [] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of having a lambda and evaluate it right away?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows the static is_enabled to be set inline.

tensorflow::ReadBoolFromEnvVar("TF_DISABLE_CUBLAS_TENSOR_OP_MATH",
/*default=*/false, &is_disabled));
} else {
TF_CHECK_OK(tensorflow::ReadBoolFromEnvVar("TF_DISABLE_TENSOR_OP_MATH",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I'm not a fan of hierarchical flags, since their relationship is not clearly documented. If the name is confusing, why not just change the other one into "TF_DISABLE_CUDNN_TENSOR_OP_MATH"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed TF_DISABLE_TENSOR_OP_MATH.

@@ -2049,6 +2152,15 @@ bool CUDABlas::DoBlasGemmWithAlgorithmImpl(
return false;
}

#if CUDA_VERSION >= 9000
cublasGemmAlgo_t cublas_algo = static_cast<cublasGemmAlgo_t>(algorithm);
if (cublas_algo >= CUBLAS_GEMM_DFALT_TENSOR_OP &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using "CUBLAS_GEMM_DEFAULT_TENSOR_OP".

Also I'm not a fan of comparing enums. Could you split "cublas_algo >= CUBLAS_GEMM_DFALT_TENSOR_OP" into a function? In fact, I'd prefer to return them one by one... So future changes have to update them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to CUBLAS_GEMM_DEFAULT_TENSOR_OP and moved inequality check to a separate function.

CUBLAS_GEMM_ALGO2, CUBLAS_GEMM_ALGO3, CUBLAS_GEMM_ALGO4,
CUBLAS_GEMM_ALGO5, CUBLAS_GEMM_ALGO6, CUBLAS_GEMM_ALGO7}) {
for (cublasGemmAlgo_t algo : {
CUBLAS_GEMM_DFALT, CUBLAS_GEMM_ALGO0, CUBLAS_GEMM_ALGO1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you split the list into its own function? So we know to update them, or even auto-gen them from the header file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what is being suggested here. The current function seems pretty clean to me. Do you envision separate functions for CUBLAS 8 vs. 9 algos, or separate function for tensor op vs. non tensor op algos. Either of these would make the API higher up the stack more complicated.

@nluehr nluehr force-pushed the gemm-tensor-core branch 2 times, most recently from fc65556 to 776842c Compare December 13, 2017 19:19
- Applies to matrix multiplications with fp16 input/output.
  Computations will fall back to pseudo-fp16 if tensor op math is
  disabled or not supported.
- Enabled by default. Tensor ops (both in cublas gemms and cudnn
  convolutions) can be disabled globally by setting the
  environment variable TF_DISABLE_TENSOR_OP_MATH=1. To disable
  tensor ops specifically for gemms or convolutions use
  TF_DISABLE_CUBLAS_TENSOR_OP_MATH=1 or
  TF_DISABLE_CUDNN_TENSOR_OP_MATH=1, respectively.
- Added CUBLAS 9.0 algorithms to GetBlasGemmAlgorithms().
@zheng-xq
Copy link
Contributor

This looks okay for now. Let's merge it in and improve it later. Thanks for the contribution!

@zheng-xq
Copy link
Contributor

LGTM

@martinwicke martinwicke added awaiting testing (then merge) and removed awaiting review Pull request awaiting review stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Dec 23, 2017
@drpngx
Copy link
Contributor

drpngx commented Dec 25, 2017

Jenkins, test this please.

@drpngx drpngx added the kokoro:force-run Tests on submitted change label Dec 25, 2017
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Dec 25, 2017
@drpngx drpngx merged commit bffa3e1 into tensorflow:master Dec 26, 2017
angerson pushed a commit to angerson/tensorflow that referenced this pull request Jan 11, 2018
- Applies to matrix multiplications with fp16 input/output.
  Computations will fall back to pseudo-fp16 if tensor op math is
  disabled or not supported.
- Enabled by default. Tensor ops (both in cublas gemms and cudnn
  convolutions) can be disabled globally by setting the
  environment variable TF_DISABLE_TENSOR_OP_MATH=1. To disable
  tensor ops specifically for gemms or convolutions use
  TF_DISABLE_CUBLAS_TENSOR_OP_MATH=1 or
  TF_DISABLE_CUDNN_TENSOR_OP_MATH=1, respectively.
- Added CUBLAS 9.0 algorithms to GetBlasGemmAlgorithms().
av8ramit pushed a commit that referenced this pull request Jan 12, 2018
* Delete empty api_guides.

PiperOrigin-RevId: 179215745

* Java: Instructions for using GPUs via Maven.

GPU support in Maven is being packaged with 1.5.0-rc0 onwards
(for Linux)

Fixes #12909

PiperOrigin-RevId: 180859336

* Fix build issues with cuda 9.1 through updating eigen. (#15796)

* Revert "Fix the headers error due to recent CUDA9.1 change (#15739)"

This reverts commit 3bc4900.

* Bump eigen dependency.

* Minor change to make tpu.rewrite compatible with Python 3.

AttrValue is a byte array, and handling this is different between Python 2 and 3.

PiperOrigin-RevId: 180306415

* TensorFlow for NVIDIA Tegra devices with CUDA support (#14167)

This commit enables CUDA support on compatible devices running Android such as the Nvidia TX1 and TX2 when using Makefile builds.

Note that JetPack for Android is required to build/run Android TF binaries with CUDA support. This should be released by Nvidia in the near future.

* Adding cuda_config.h to the pip package. (#15961)

* Adding cuda_config headers to our GPU build.

* Updating the local cuda path for cuda_headers.

* Removing the cuda_config blacklist.

* Buildifier fix.

* Ignoring .so files and manually adding the cuda_config.h file.

* Fixing the path for the src_dir.

* One last minor fix for path.

* Adding brackets.

* Minor fixes for "Linear" tutorial

PiperOrigin-RevId: 179061248

* Sync Premade and Custom estimator docs with example code.

PiperOrigin-RevId: 179404175

* rename files

PiperOrigin-RevId: 179683700

* Modernize old "get_started/get_started.md", as "programmers_guide/low_level_intro.md".

PiperOrigin-RevId: 179807033

* Add links to low level API intro

PiperOrigin-RevId: 179844300

* Make images larger

PiperOrigin-RevId: 181034398

* minor fixes to new "low_level_intro"

PiperOrigin-RevId: 181172455

* typo

PiperOrigin-RevId: 181185642

* Replace get_started

Also add sub-sections to leftnav files,
and sync leftnav and index files.

PiperOrigin-RevId: 181394206

* Added a "Getting Started with TensorFlow for ML Beginners" chapter to Get
Started section.

PiperOrigin-RevId: 181396430

* Add support for CUBLAS_TENSOR_OP_MATH in fp16 GEMM (#13451)

- Applies to matrix multiplications with fp16 input/output.
  Computations will fall back to pseudo-fp16 if tensor op math is
  disabled or not supported.
- Enabled by default. Tensor ops (both in cublas gemms and cudnn
  convolutions) can be disabled globally by setting the
  environment variable TF_DISABLE_TENSOR_OP_MATH=1. To disable
  tensor ops specifically for gemms or convolutions use
  TF_DISABLE_CUBLAS_TENSOR_OP_MATH=1 or
  TF_DISABLE_CUDNN_TENSOR_OP_MATH=1, respectively.
- Added CUBLAS 9.0 algorithms to GetBlasGemmAlgorithms().

* Adding page to tensorflow.org with directions for building the TFLite demo on Android.

PiperOrigin-RevId: 179970218
@nluehr nluehr deleted the gemm-tensor-core branch January 13, 2021 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.