-
Notifications
You must be signed in to change notification settings - Fork 45.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tensorrt] Failed Execution #3835
Comments
I don't think K80s can run fp16. Does the fp32 loop work? If you have access to a P100 or V100, does running the command line above work? |
Nada. Running:
Got:
|
Oh boy this is going to be fun to debug: munmap is failing:
|
Anything that I can do/help to solve this. |
It's possible this is a CUDA/CuDNN compatibility issue:
|
Yes I can make the whole training and execution on the device, I already try to use cudnn 7005 with no success, bear in mind that the all these logs are generated (except the first one) with cuda 9 and cudnn 7.0.0. So yes, I have been trying to put the standard environment. |
I'm a little confused in that case; based on the error message, it looks like maybe TF was build with 7.0, but you are currently running with 7.1. The initial env details above also indicate CuDNN 7.1. Can you provide the output to the following commands?
(Note that the last of these may actually be a different path if your cuda path is non-standard.) |
Oh I might have to clarify, the first error was using a standard aws instance which comes preinstalled with cuda and cudnn, then I created another instance on which I installed cuda 9.0 with cudnn 7.0 specific, so that I have the most standard configuration to replicate the bug. |
I see. To debug, I would gather:
|
nvidia-smi: Reports the python process until it hits the stack trace. arounf 20 % usage. |
I have the same error and i do a |
It sounds like there is something about your local env/C/py/something that is perhaps causing problems.
|
When i reboot my pc, i had the error same error of you : |
@rere-corporation The error you are getting is because we had an issue where we were enforcing the minor cuDNN version TensorFlow is compiled with to match the minor version installed locally. In this case 7.0.5 (what it was compiled with) does not match 7.1.2 (what is installed) After working with NVIDIA they updated the documentation to indicating this should work fine and we updated TensorFlow. I believe the change is in the nightly builds and will be in TF 1.8 forward. I did my testing on AWS p3 instances using docker. If you want to try my pip package (which is custom built) it was built with 9.0 and cuDNN 7.0.5 and will complain if you have 7.1 installed. This what I used for testing on March 18th and things have changed but it might be of interest so I am sharing. |
@Davidnet I seen libnvinfer.so.4 in your log, but tensorflow 1.7 require tensorrt-3.0.4, see here https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/tensorrt |
Had this same issue and pip install tf-nightly-gpu is what solved for me. Was running tf 1.7, nvcc 9.0, and cudnn 7 |
Switching over to tf-nightly-gpu worked for me too. |
I suffer from the same problem with cuda 9.1 and cudnn 7.1, tf-nightly-gpu can't help me. |
I solve my problem by reinstall cudnn 7.0 and reinstall tensorflow1.7 |
Had this same issue and pip install tf-nightly-gpu is what solved for me. Was running tf 1.7, nvcc 9.0, and cudnn 7.1.2 over debian 8 |
Yes, I spinned off an Amazon V100 gpu instance with the Deep Learning AMI that comes with cudnn 9.0. It all runs fine if you install tf-nightly-gpu with Python 2.7 and run it without changing anything to cudnn or cuda or anything to do with AMI. The tf-nightly-gpu got installed with version 1.8 |
A new error appeared for the test, after virtualenv/lib/python3.5/site-packages/tensorflow/contrib/tensorrt/_wrap_conversion.so: undefined symbol: _ZNK10tensorflow17StringPieceHasherclENS_11StringPieceE Running on Ubuntu 16.04, Python 3.5, TensorFlow-GPU-1.7, CUDA 9.0.176, cudnn v7 When running Can anyone help? |
I've demagled the symbol, it appears to be a: tensorflow::StringPieceHasher::operator()(tensorflow::StringPiece) are we using some hash on the test? |
Confirmed - on ubuntu 16.04/cuda9.0/cudnn7102 - a tf-nightly-gpu solved this problem for me. |
tf-nightly gpu solved it!Found it after 6 hours of reinstalling everything multiple times and research,thanks a lot! |
tf-nightly-gpu solved this problem for me too. |
So I have a similar problem in combination with other problems of similar obscurity. Nothing on this feed seems to work :( I am running on Windows 10 (unfortunately). I have tried all combinations of CUDA toolkit and CuDNN (versions, that is). I am using anaconda and pip for most of my modules. Someone please help! WARNING:tensorflow:From C:\Users\caleb\AppData\Local\Continuum\anaconda3\envs\gpu2\lib\site-packages\tensorflow\contrib\learn\python\learn\datasets\base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version. A LOT more of the same warning for different items (~100) WARNING:root:Variable [FeatureExtractor/MobilenetV1/Conv2d_9_pointwise/weights/RMSProp_1] is not available in checkpoint I should also note that this maxes out utilization of both of my Processors and all of my (very abundant) RAM. |
@csindic This appears to be a different problem. I am going to close this issue, as the original problem seems to have been solved with fixing versioning across the libraries. @csindic , please resubmit and pay attention to the issue template (https://github.com/tensorflow/tensorflow/issues/new). |
@weizh888 hi, i had the same problem, do you solve it or can anyone help?
|
@Davidnet I meet the same problem about TensorRT, Have you solved it?
System: Ubuntu 18.04 |
@MacwinWin , @jerryhouuu These appear to be different issues; please open new issues to keep the responses clear and distinct. |
@jerryhouuu It was solved. |
I had this same problem and for me, tf-nightly-gpu solved this problem! 💯 |
@weizh888 |
|
|
I downgraded |
I have the same problem, but could not solve it with your solution. Strange thing: I have installed Cudnn 7.3.1 - still I get the same version numbers as you did in the error message. |
cool, tf-nightly-gpu solved the problem ! |
@saskra I have the same problem. Do you solve it? |
Yes, with this tutorial: https://github.com/pplcc/ubuntu-tensorflow-pytorch-setup |
I have the same problem. It got resolved by convert the CUDNN from 7.1.3-cuda8.0_0 to 7.0.5-cuda8.0_0. (bash: conda install cudnn=7.0) |
I have the same error message here. But I am using cudnn 7005 instead of 7500. I don't even have cudnn 7500 installed on my docker. May I ask what should I do? 2019-06-18 08:17:41.727989: E tensorflow/stream_executor/cuda/cuda_dnn.cc:378] Loaded runtime CuDNN library: 7501 (compatibility version 7500) but source was compiled with 7004 (compatibility version 7000). If using a binary install, upgrade your CuDNN library to match. If building from sources, make sure the library loaded at runtime matches a compatible version specified during compile configuration. |
System information
python tensorrt.py --frozen_graph=resnetv2_imagenet_frozen_graph.pb --image_file=image.jpg --native --fp32 --fp16 --output_dir=output
Describe the problem
fresh os installation, try to use the tensorrt example.
Source code / logs
The text was updated successfully, but these errors were encountered: