Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU support for float16 batched matmul #18436

Merged
merged 2 commits into from May 10, 2018

Conversation

@benbarsdell
Copy link
Contributor

benbarsdell commented Apr 11, 2018

  • Uses cublasGemmBatchedEx introduced in CUDA 9.1.
  • Includes support for Tensor Op math.
  • Falls back to a loop over non-batched gemm calls on older CUDA
    versions or GPU architectures.

Note that //tensorflow/python/kernel_tests:batch_matmul_op_test previously passed only because it does not specify force_gpu=True and falls back to the CPU.

Notifying @tfboyd

- Uses cublasGemmBatchedEx introduced in CUDA 9.1.
- Includes support for Tensor Op math.
- Falls back to a loop over non-batched gemm calls on older CUDA
  versions or GPU architectures.
@googlebot googlebot added the cla: yes label Apr 11, 2018
@tfboyd tfboyd requested a review from zheng-xq Apr 11, 2018
@zheng-xq zheng-xq requested review from yzhwang and removed request for zheng-xq Apr 11, 2018
@zheng-xq

This comment has been minimized.

Copy link
Contributor

zheng-xq commented Apr 11, 2018

Also add @protoget

@jhseu jhseu self-assigned this Apr 12, 2018
Copy link
Contributor

yzhwang left a comment

Thanks a lot for the change Ben! I think this will also be useful to enable autotune for batch_matmul.

Please make sure that the float16 CUDA >= 9.1 batched_matmul on the GPU will really use cublasGemmBatchedEx() (Ideally we would want to know the CUDA/cudnn version at python level, so that we can write flexible code according to different versions. But I don't think I know how to do that or if TensorFlow supports that). Other than that, only one other comment (see below).

cudaDataType_t data_type = CUDADataType<T>::type;
cudaDataType_t compute_type = CUDA_R_32F;

#if CUDA_VERSION >= 9010

This comment has been minimized.

Copy link
@yzhwang

yzhwang Apr 16, 2018

Contributor

Is there any reason we cannot put the following part into the already existing function: CUDABlas::DoBlasGemmBatchedInternal()? In that way, cublasGemmBatchedEx() can be used for all data types (half, float, and double).

@ekelsen

This comment has been minimized.

Copy link
Contributor

ekelsen commented Apr 26, 2018

@benbarsdell Thanks for this change - could you please respond to @yzhwang's comments so that we can get to merging?

@benbarsdell

This comment has been minimized.

Copy link
Contributor Author

benbarsdell commented May 1, 2018

Sorry for the delay. I'm working on addressing this now.

@benbarsdell

This comment has been minimized.

Copy link
Contributor Author

benbarsdell commented May 9, 2018

@yzhwang does this look OK now? We could add autotuning in a later PR.

Copy link
Contributor

yzhwang left a comment

Thank you a lot for the changes Ben!

@jhseu jhseu merged commit f08f24c into tensorflow:master May 10, 2018
16 checks passed
16 checks passed
Android Demo App Internal CI build successful
Details
GPU CC Internal CI build successful
Details
GPU Python3 Internal CI build successful
Details
MacOS Contrib Internal CI build successful
Details
MacOS Python2 and CC Internal CI build successful
Details
Ubuntu CC Internal CI build successful
Details
Ubuntu Makefile Internal CI build successful
Details
Ubuntu Python2 Internal CI build successful
Details
Ubuntu Python3 Internal CI build successful
Details
Ubuntu Python3 PIP Internal CI build successful
Details
Ubuntu Sanity Internal CI build successful
Details
Ubuntu contrib Internal CI build successful
Details
Windows Bazel Internal CI build successful
Details
Windows CMake Internal CI build successful
Details
XLA Internal CI build successful
Details
cla/google All necessary CLAs are signed
@benbarsdell benbarsdell deleted the benbarsdell:add-gpu-float16-batched-matmul branch May 10, 2018
@achalshah20

This comment has been minimized.

Copy link
Contributor

achalshah20 commented May 10, 2018

I can not compile tensorflow with cuda 9.1 because of this. Shouldn't PERFTOOLS_GPUTOOLS_CUBLAS_WRAP macro be STREAM_EXECUTOR_CUBLAS_WRAP?

PR #19210 will fix this.

@benbarsdell

This comment has been minimized.

Copy link
Contributor Author

benbarsdell commented May 10, 2018

Thanks @achalshah20! This got overlooked in the automerge.

freedomtan added a commit to freedomtan/tensorflow that referenced this pull request Jul 30, 2018
relevant code in tensorflow#18436, should be safe to enable it
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.