New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of nn.conv1d and keras.layers.Conv1D is low the first time any given input size is processed even if retracing is prevented! #54456
Comments
I believe to have made some progress using tensorboard profiler. The first time the convolution is run on a given size the profiler displays running all the following kernels
The second time any given size is processed only a single kernel is used. In the present case these are either This is problematic notably for inference with audio where size of audio files can vary between 1 second (16000 samples) and 40 seconds (640000 samples) because as mentioned above the software will spend most of the time trying to optimize and will therefore need more than a day before it starts to perform optimally. The question here would be whether this trial stage can be prevented by means of manually preselecting a strategy. |
Hi @chunduriv ! Could you please look at this issue? Its replicating in 2.8 and throwing different error in 2.7 and nightly. Thanks! |
@roebel ! Did you try the same in distribution training yet? |
@mohantym Thanks for your reply. But I wonder why you suggest distribution training? I don't have a problem with training I have a problem with inference. Also for my use cases, I don't have access to multiple GPUs so I cannot use distribution inference - or could I? If these tests would be run in parallel on the same GPU this may help but would also require more memory. |
Hi, got a similar problem (running on GTX 1080), using conv1d layers makes my kernel died every time, as my memory GPU is fully saturated even with very low dim input shpaes. I tried tensorflow version 2.5, 2.6, 2.7 and 2.9. I am running on CUDA v11.6 and CudNN v8.4.1 (windows). I could not find any thing to avoid crashing. If anyone has an idea, your very much welcome! |
I am sorry but the report of @LerysG is completely unrelated to what I describe here.
So there is nothing I could reasonably check here. Similarly, issue 56387 is not related at all. The problem I describe is an implementation problem in the TensorFlow code. |
I think I found the source code that is responsible for the issue: cuda supports a number of convolution kernels that in the current version of TensorFlow 2.9.0 are obtained by means of CudnnSupport::GetConvolveRunners here
Which is then used here in autotune functions
It appears that each time a configuration consisting of data shape, filter shape, and maybe other parameters are encountered the cuda driver tests all of the kernels and retains the most efficient one. This is a very nice optimization for most cases, notably training with constant batch shapes, or inference with constant image sizes. For inference with audio signals the cuda implementation is testing most of the time all kernels versions. It hardly ever benefits, from the information which of the kernels is the most efficient one for any given configuration, as the same configuration is hardly ever encountered a second time. All this reminds me of fftw3, which also uses such kind of autotuning depending on the FFT size, but there FFT sizes do not change so much, and more over it is possible to store results in form of wisdom. I think for the present case it would be nice to affect the kernel selection explicitly. I wonder if one cannot define a maximum size above which the selected kernel would no longer be adapted. E.g assuming one would be able to select a maximum adaptation length via env variable TENSORFLOW_CONV_MAXADAPLEN and if the variable is set, one would then always run autotuning with min (shape[i], TENSORFLOW_CONV_MAXADAPLEN). In that case the current problem could be solved if needed without any negative impact for te cases that have networks working on constant shapes (image processing). |
Please make sure that this is an issue related to performance of TensorFlow.
As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:performance_template
System information
yes
Linux Ubuntu 20.04
pip
tested with TF 2.4, 2.6, 2.8
3.7
Cuda Toolkit 11.3.1
CUDNN 8.3.1
Tested with GeForce GTX 1050 Ti and GeForce GTX 1080 Ti
Describe the current behavior
In both cases using eager mode and using a
tf.function
withexperimental_relax_shapes=True
running tf.nn.conv1d is slow the first time a tensor of any given size is processed and the processing of a new tensor of the same size then becomes 4 times (on GPU 1050 TI) or 10 times (on GPU 1080 Ti) faster the second or further times
.
The observed behavior is a severe problem for running inference with audio signals because audio signals generally have very different sizes and therefore in a production environment the code will run only with 10% of maximum performance (on a GPU 1080 Ti)
for the first few 100k examples until the model has seen sufficiently many lengths to achieve peak performance.
Describe the expected behavior
conv1d processing time should depend on the size of the input vector and not on the number of times the same size has been seen.
Standalone code to reproduce the issue
colab notebook is
here
Other info / logs Include any logs or source code that would be helpful to
diagnose the problem.
Result running on colab with GPU
Notes:
The text was updated successfully, but these errors were encountered: