Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to find the dnn implementation while using recurrent layers #45248

Closed
Avditvs opened this issue Nov 28, 2020 · 9 comments
Closed

Fail to find the dnn implementation while using recurrent layers #45248

Avditvs opened this issue Nov 28, 2020 · 9 comments
Assignees
Labels
comp:gpu GPU related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author TF 2.3 Issues related to TF 2.3 type:build/install Build and install issues

Comments

@Avditvs
Copy link

Avditvs commented Nov 28, 2020

System information

  • OS Platform and Distribution: Ubuntu 18.04 running in WSL2
  • TensorFlow version): 2.3.1
  • Python version: 3.6.9
  • CUDA/cuDNN version: CUDA 10.1/ cuDNN 7.6.5.32
  • GPU model and memory: RTX 2060 6GB

Current behavior
I want to train a model containing Keras LSTM layers, however the following error occurs:

Jupyter output:
UnknownError: Fail to find the dnn implementation. [[{{node CudnnRNN}}]] [[sequential_2/lstm_1/PartitionedCall]] [Op:__inference_train_function_5270]

Console output:
OP_REQUIRES failed at cudnn_rnn_ops.cc:1510 : Unknown: Fail to find the dnn implementation.

Expected behavior
I expect the code to run since I am able to run Conv2D layers wich are properly accelerated by the GPU.
I have already tried multiple things such as using different Tensorflow/Cuda/cuDNN versions.
I also tried to enable the memory growth as described in #36508 but it did not work either.

Standalone code to reproduce the issue
The environment was set up by following the installation instructions (without installing the nvidia driver inside the VM as mentioned in the nvidia documentation ): https://www.tensorflow.org/install/gpu#install_cuda_with_apt

I was able to reproduce this issue by running the RNN tutorial available on the online Tensorflow documentation : https://www.tensorflow.org/guide/keras/rnn

I would appreciate any help to solve this issue.

@Avditvs
Copy link
Author

Avditvs commented Nov 28, 2020

I finally found the way to solve the problem by reverting my Nvidia driver installed on Windows from 465 to 460, as mentioned in the note in the nvidia documentation : https://docs.nvidia.com/cuda/wsl-user-guide/index.html

Since using Windows Subsystem for Linux is becoming more and more common, why not adding a section in the documentation to set up Tensorflow with CUDA inside WSL ?

@ravikyram ravikyram added comp:gpu GPU related issues TF 2.3 Issues related to TF 2.3 labels Nov 29, 2020
@ravikyram ravikyram added type:build/install Build and install issues and removed type:bug Bug labels Nov 29, 2020
@mihaimaruseac
Copy link
Collaborator

We are currently discussing moving towards Windows GPU support only via WSL.

@jvishnuvardhan jvishnuvardhan added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Dec 8, 2020
@ICG14
Copy link

ICG14 commented Jan 26, 2021

I continue without solving this issue...
I have tried all that you have mentioned but it continues the same problem
my OS is:

Ubuntu 18.04
CUDA 10.0
Tensorflow 2.0
Nvidia-driver 460 (Although I have tried with 450 and it also does not work)
geForce RTX2060
Python 3.7

I have tried to compile with CUDA 10.1 and TF 2.1 but I continue without solving it. It starts to be a little frustrating

This is what I obtain after fitting:

Epoch 1/50
2021-01-25 18:59:34.964218: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2021-01-25 18:59:35.096029: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
128/2156 [>.............................] - ETA: 15sWARNING:tensorflow:Can save best model only with val_loss available, skipping.

.2021-01-25 18:59:35.364099: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2021-01-25 18:59:35.364136: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at cudnn_rnn_ops.cc:1510 : Unknown: Fail to find the dnn implementation.
2021-01-25 18:59:35.364158: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: Fail to find the dnn implementation.
[[{{node CudnnRNN}}]]
2021-01-25 18:59:35.364356: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: {{function_node __forward_cudnn_lstm_with_fallback_2517_specialized_for_sequential_lstm_StatefulPartitionedCall_at___inference_distributed_function_3196}} {{function_node __forward_cudnn_lstm_with_fallback_2517_specialized_for_sequential_lstm_StatefulPartitionedCall_at___inference_distributed_function_3196}} Fail to find the dnn implementation.
[[{{node CudnnRNN}}]]
[[sequential/lstm/StatefulPartitionedCall]]

All testings of the cuDnn and Cuda works well.

@pupscub
Copy link

pupscub commented May 16, 2021

I continue without solving this issue...
I have tried all that you have mentioned but it continues the same problem
my OS is:

Ubuntu 18.04
CUDA 10.0
Tensorflow 2.0
Nvidia-driver 460 (Although I have tried with 450 and it also does not work)
geForce RTX2060
Python 3.7

I have tried to compile with CUDA 10.1 and TF 2.1 but I continue without solving it. It starts to be a little frustrating

This is what I obtain after fitting:

Epoch 1/50
2021-01-25 18:59:34.964218: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2021-01-25 18:59:35.096029: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
128/2156 [>.............................] - ETA: 15sWARNING:tensorflow:Can save best model only with val_loss available, skipping.

.2021-01-25 18:59:35.364099: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2021-01-25 18:59:35.364136: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at cudnn_rnn_ops.cc:1510 : Unknown: Fail to find the dnn implementation.
2021-01-25 18:59:35.364158: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: Fail to find the dnn implementation.
[[{{node CudnnRNN}}]]
2021-01-25 18:59:35.364356: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: {{function_node __forward_cudnn_lstm_with_fallback_2517_specialized_for_sequential_lstm_StatefulPartitionedCall_at___inference_distributed_function_3196}} {{function_node __forward_cudnn_lstm_with_fallback_2517_specialized_for_sequential_lstm_StatefulPartitionedCall_at___inference_distributed_function_3196}} Fail to find the dnn implementation.
[[{{node CudnnRNN}}]]
[[sequential/lstm/StatefulPartitionedCall]]

All testings of the cuDnn and Cuda works well.

Did u find any solutions?
I have same system configurations and facing the same issue while running my dl model

@sanatmpa1
Copy link

@iamMOY,

Can you take a look at this link to know about tested configurations and please update to latest stable version i.e TF 2.6.0 and create a new issue if you face any. Thanks!

@sanatmpa1 sanatmpa1 self-assigned this Sep 1, 2021
@sanatmpa1
Copy link

@Avditvs,

As the problem has been fixed after you have downgraded the NVIDIA driver. Can you confirm if we are good to close this issue? Thanks!

@sanatmpa1 sanatmpa1 added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Sep 1, 2021
@google-ml-butler
Copy link

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

@google-ml-butler google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Sep 8, 2021
@google-ml-butler
Copy link

Closing as stale. Please reopen if you'd like to work on this further.

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:gpu GPU related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author TF 2.3 Issues related to TF 2.3 type:build/install Build and install issues
Projects
None yet
Development

No branches or pull requests

8 participants