New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent results when trying to enable Tensor Cores on NVIDIA T4 #42311
Comments
Can you please try to convert your input tensors and model into channels last memory format |
If inputs are FP16 and CUDNN_TENSOR_OP_MATH is set on the convolution descriptor, Tensor Core use is allowed but not guaranteed. Cudnn may still choose a non-tensor-core algo if its heuristics expect the non-tensor-core algo to be faster for your convolution sizes. The fact that CUDNN_TENSOR_OP_MATH appears not to be set sometimes for FP16 inputs is unexpected. Afaik Pytorch's convolution backend should always set CUDNN_TENSOR_OP_MATH for FP16 inputs, at least for cudnn 7605. Do you have a minimal example of a convolution with FP16 inputs that doesn't get CUDNN_TENSOR_OP_MATH set for its descriptor? If so we should track down why. To sandbox potential issues with jit/autocast, try running the ordinary Python model with |
Looking at some of the convolutions I posted above, do you think that might be the case? I can't really see anything that indicates that Tensor Cores wouldn't help.
I did an experiment from the Python code base as you said. I set
Also, still getting the weird behaviour where PyTorch sets
I think we can rule out interference from either AMP, JIT or C++. As I understand there are basically two issues:
I'll also try Vitaly's suggestion and report back! |
@VitalyFedyunin I think that there's no support for NDWHC in |
Shouldn't it be automatically transposed to whichever memory format (channels last) will allow half-precision? I thought CUDNN was doing this already, otherwise torchscript. |
How do we convert to the memory format? |
Does this mean that there's no Tensor Core support for models with 3D convolutions? (because that would require them to be NDWHC format) |
🐛 Bug
I am trying to figure out if my model is correctly using Tensor Cores on NVIDIA T4 but it seems that PyTorch is not enabling them correctly.
Context: I'm trying to get Tensor Cores to work with I3D. The model is converted to TorchScript and I'm executing it through the C++ API. I've converted the model to FP16 with
.half()
. And I'm seeing the following cudnn logs:As can be seen from the logs, PyTorch calls
cudnnSetConvolutionMathType
three times. First, it sets the math type to DEFAULT, then to TENSOR_OP_MATH, then back to DEFAULT. This only happens about 1 out of 4 times, the other times is works correctly and PyTorch callscudnnSetConvolutionMathType
with DEFAULT, en then TENSOR_OP_MATH twice.Even if PyTorch sets the math type correctly, I'm not seeing Tensor Core usage (I think), the output from
nvprof
shows:I would expect to see "h884" in
volta_fp16_scudnn_fp16_128x128_stridedB_splitK_small_nn_v1
(i.e.volta_fp16_h884 cudnn_fp16_128x128_stridedB_splitK_small_nn_v1
) if Tensor Cores were used.When using the performance profiling tools, it does show some Tensor Core usage (less than 1%) but I think that's incorrect.
How can I correctly trigger Tensor Core usage in this case?
To Reproduce
Steps to reproduce the behavior:
model.half()
torch.jit.script
I'm pretty sure the same would happen without step 3 and 4 but haven't gotten around to testing that yet.
Expected behavior
I would expect Tensor Core usage in this case.
Environment
conda
,pip
, source): pip,libtorch
via websiteAdditional context
I'm still trying to see if I can get Tensor Core usage by converting the model through
torch.cuda.amp
but I'm running in to this bug: #36428Will report back if I have results from that. I'm not sure if it matters though, as far as I understand, if Tensor Cores cannot even be activated if the whole model is converted to FP16, then automatic mixed precision won't help but I might be wrong about that.
EDIT: I got automatic mixed precision to work and I'm seeing the same results in terms of cudnn logs. Sometimes PyTorch correctly sets
CUDNN_TENSOR_OP_MATH
, sometimes it doesn't. Example:✅ Correct:
🚫 Incorrect:
EDIT 2: Output from
nvprof
still suggests no usage of Tensor Core kernels when using automatic mixed precision, just like before 😞Thanks in advance!
cc @mcarilli @csarofeen @ptrblck @VitalyFedyunin @jamesr66a
The text was updated successfully, but these errors were encountered: