Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA 10 #22706

Closed
tfboyd opened this issue Oct 3, 2018 · 55 comments
Closed

CUDA 10 #22706

tfboyd opened this issue Oct 3, 2018 · 55 comments
Assignees

Comments

@tfboyd
Copy link
Member

tfboyd commented Oct 3, 2018

Edited: 17-DEC-2018 Nightly builds are CUDA 10
Edited: 27-NOV-2018 Added 1.12 FINAL builds.
Edited: 29-OCT-2018 Added 1.12 RC2 builds.
Edited: 17-OCT-2018 Added 1.12 RC1 builds.

Nightly Builds are now CUDA 10 as of 16-DEC-2018

  • [DONE] CUDA 10 in the nightly builds mid-DEC 2018 if testing goes well
  • CUDA 10 in official in mid-JAN-2019 as TF 1.13.

Install instructions for CUDA 10 for a fresh system that should also work on an existing system if using apt-get.

TensorFlow will be upgrading to CUDA 10 as soon as possible. As of TF 1.11 building TensorFlow from source works just fine with CUDA 10 and possibly even before. There is nothing special needed other than all of the CUDA, cuDNN, NCCL (optional), and TensorRT (optional) libraries. If people have some builds feel free to link them here (Keep in mind if you download them to decided what risk you want to take based on the source.) as well as any issues. Also NCCL is now open source again and soon will be back to being automatically downloaded by bazel and included in the binary.

CUDA 10 would likely go into TF 1.13, which is not scheduled. I will update this post as I have more info to share. I hope to flip nightly builds to CUDA 10 in November but the TF 1.13 release will likely push to early Jan.

I and some really cool people below made some binaries (even windows, see comments below) to help people out along with my really bad "instructions". I suspect the instructions linked in the comments below are better.

Libraries used (rough list, similar to what I listed above)

  • Ubuntu 16.04 LTS on GCE from base Google Cloud Ubuntu image.
  • Python 2.7 because that is how I roll and someday I will change.
  • Install 410+ driver using apt-get
  • CUDA 10.0.130 from .tgz install (do not install the driver)
  • cuDNN-10.0 v7.3.0.29 rom tgz install
  • nccl 2.3.4 for CUDA 10 (I am not sure it matters I have mixed them up before)
  • TensorRT-5.0.0 for CUDA 10 cudnn 7.3 (TensorRT 5 was not stated as supported by Tensorflow when
  • Compute Capability: 3.7,5.2,6.0,6.1,7.0

TF 1.12.0 FINAL

Python 2.7 (Ubuntu 16.04)

Python 3.5 (Ubuntu 16.04)

Dockerfiles
These are partial files and the apt-get commands should work on non-Docker systems as well assuming you have the NVIDIA apt-get repositories which should be the same as listed on tf.org.

  • Devel Confirmed by building TensorFlow at 1.12RC0.
  • Runtime
@alanpurple
Copy link
Contributor

alanpurple commented Oct 4, 2018

I've built tf 1.12.0rc0 with all the latest, test well except this silly warning

*** WARNING *** You are using ptxas 10.0.145, which is older than 9.2.88. ptxas 9.x before 9.2.88 is known to miscompile XLA code, leading to incorrect results or invalid-address errors.

You do not need to update to CUDA 9.2.88; cherry-picking the ptxas binary is sufficient.

seems no problem for now

only thing left is python 3.7, which I haven't tried yet

@alanpurple
Copy link
Contributor

Could we expect CUDA10+python3.7 prebuilt images from 1.13rc0?

@tfboyd
Copy link
Member Author

tfboyd commented Oct 4, 2018

@alanpurple Ahh I had not tried CUDA 10 with XLA yet, it is now compiled in by default. Great feedback I will pass the warning on to the team (b/117268529), looks like they might have a slight error in their version checker. As far as Python versions that is not under my authority (even a little). @gunan owns that part of the matrix.

tensorflow-copybara pushed a commit that referenced this issue Oct 5, 2018
This was completely broken for CUDA versions > 9 and resulted in spurious warnings.

Reported in #22706#issuecomment-426861394 -- thank you!

PiperOrigin-RevId: 215841354
@tfboyd tfboyd closed this as completed Oct 5, 2018
@tfboyd tfboyd reopened this Oct 5, 2018
@tejavoo
Copy link

tejavoo commented Oct 6, 2018

@tfboyd Hey could you kindly share the .whl of tensorflow 1.11 with,

CUDA 10.0.130_410.48
cuDNN 7.3.0.29 for CUDA 10
I was not able to build the same. I have a rtx 2080 with ubuntu 16.04. Thanks in advance

@tfboyd
Copy link
Member Author

tfboyd commented Oct 8, 2018

Updated 10-15-2018: Moved my builds into the original comment to make the thread shorter.

Updated 10-10-2018: Build is under way. For python 2.7 and compute 3.7,5.2,6.0,6.1,7.0 (normally only do 6.0 and 7.0 so I tried to think about what all of you might want)

benjamintanweihao pushed a commit to benjamintanweihao/tensorflow that referenced this issue Oct 12, 2018
This was completely broken for CUDA versions > 9 and resulted in spurious warnings.

Reported in tensorflow#22706#issuecomment-426861394 -- thank you!

PiperOrigin-RevId: 215841354
@homieseven
Copy link

Just wrote instruction how to build TF from scratch. Maybe somebody found it useful

@arunmandal53
Copy link

arunmandal53 commented Oct 13, 2018

Here is my new tutorial for Building Tensorflow 1.12 + CUDA 10.0 + CUDNN 7.3.1 + NCCL 2.3.5 + bazel-0.17.2 https://www.python36.com/how-to-install-tensorflow-gpu-with-cuda-10-0-for-python-on-ubuntu/, you will also get prebuilt wheel at last.

@ivan-marroquin
Copy link

Thanks @arunmandal53 for taking the time to write the tutorial. However, I need to install Tensorflow in the Python installation that is under my home account rather than installing Tensorflow on Python that comes with Ubuntu. I think that I need to wait until Tensorflow releases a version that can be installed using pip and is compatible with CUDA 10 (and cuDNN)

@arunmandal53
Copy link

arunmandal53 commented Oct 14, 2018

Tensorflow 1.12 whl for CUDA 10.0 + CUDNN 7.3.1 + NCCL 2.3.5 + python 3.6 +Ubuntu

@tfboyd
Copy link
Member Author

tfboyd commented Oct 14, 2018

@ivan-marroquin Incase you did not know you can install any .whl file anywhere you want. You can use the .whl packages I created or the ones @arunmandal53 has done which covers Python 3.6. Pretty nice selectoin :-). I might also not understand you issue but you can do this:

pip install http://path to whatever you want
or
pip install /local/path/to/whl

You do not have to count on pypi and --user would also work and installing in a virtual env.

@ivan-marroquin
Copy link

@tfboyd thanks for the clarification. And thanks to all of you for a such impressive help! i will give a try as soon as possible

@tejavoo
Copy link

tejavoo commented Oct 14, 2018

@everyone thanks for spilling all the bits and bytes. I have built the TF with all my requirements and successfully running the tests for 5days now :)

@arunmandal53
Copy link

Tensorflow 1.12 whl for CUDA 10.0 + CUDNN 7.3.1 + python 3.6 + Windows

@arunmandal53
Copy link

arunmandal53 commented Oct 16, 2018

Here is my new tutorial on building Tensorflow 1.12 + CUDA 10.0 + CUDNN 7.3.1 + + Bazel 0.17.2 on Windows. https://www.python36.com/how-to-install-tensorflow-gpu-with-cuda-10-0-for-python-on-windows/ .

@jim-parsons
Copy link

jim-parsons commented Oct 17, 2018

Tensorflow 1.12 whl for CUDA 10.0 + CUDNN 7.3.1 + NCCL 2.3.5 + python 3.6 +Ubuntu

@arunmandal53
do you have tensorflow whl for windows 10?
thanks

When i install the cuda 10 and tensorflow-gpu 1.11.0 with rtx 2080, it went to the error "ImportError: DLL load failed:"
I tryied to use cuda 9.2 with unoffical tensorflow-gpu whl and it successed!
But it failed with cuda 10, so i wonder if anyone has a compiled tensorflow-gpu whl ?
Thanks

cuda 10.0
cudnn 7.3.1
tensorflow-gpu 1.11.0
rtx 2080
image

@arunmandal53
Copy link

arunmandal53 commented Oct 17, 2018

You will find whl links above.

@tejavoo
Copy link

tejavoo commented Oct 29, 2018

@LouSparfell
Copy link

LouSparfell commented Nov 4, 2018

Ubuntu 18.04 - CUDA 10.0 - libcudnn 7.3.1 + Python 3.6
I have tried the various wheel packages but it coredumps. :(

I'm trying bazel now, but it's pain in the ass... just to have the last version of cuda. :(

EDIT:
I have built it nicely by following these steps 👍
https://www.python36.com/how-to-install-tensorflow-gpu-with-cuda-10-0-for-python-on-ubuntu/

There was a problem in bazel and to build tf I needed to put the --batch option
bazel --batch build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

@MingyaoLiu
Copy link

MingyaoLiu commented Nov 5, 2018

Here is a windows build tensorflow 1.12.0, CUDA 10, cuDNN 7.3.1, RTX2080TI (feature level 7.5).
tensorflow-1.12.0-cp36-cp36m-win_amd64.whl
one above in the comment I'm not sure what the CUDA feature level is, so I build a new one.

@lahwran
Copy link
Contributor

lahwran commented Nov 6, 2018

are nightlies cuda 10 at the moment?

edit: nope.

@MingyaoLiu
Copy link

Here is a ubuntu 18.10 build (kernal 4.18), tensorflow r1.12 branch, python 3.6, CUDA 10.0, cuDNN 7.4.1.5, NCCL 2.3.7, tensorRT 5.0, compute capability 7.5, 6.1, for my RTX 2080 TI.
tensorflow-1.12.0-cp36-cp36m-linux_x86_64.whl

@MingyaoLiu Were you able to install CUDA 10 on Ubuntu 18.10 ? Did you use 18.04 deb package?

Yes. If i remember correctly i manually installed with cuda 10 for 18.04 run file. Deb may not work?

@alifare
Copy link

alifare commented Dec 7, 2018

Here is a ubuntu 18.10 build (kernal 4.18), tensorflow r1.12 branch, python 3.6, CUDA 10.0, cuDNN 7.4.1.5, NCCL 2.3.7, tensorRT 5.0, compute capability 7.5, 6.1, for my RTX 2080 TI.
tensorflow-1.12.0-cp36-cp36m-linux_x86_64.whl

@MingyaoLiu Were you able to install CUDA 10 on Ubuntu 18.10 ? Did you use 18.04 deb package?

Yes. If i remember correctly i manually installed with cuda 10 for 18.04 run file. Deb may not work?

Hi @MingyaoLiu ,
Can you share the .whl file you compiled?
Because my GPU's compute capability is 3.5 . But the .whl file provided by @arunmandal53 requires 5.0, so it doesn't work on my computer.

@Garfield2013
Copy link

Tensorflow 1.12 whl for CUDA 10.0 + CUDNN 7.3.1 + python 3.6 + Windows

What compute level?
Can you please share how you make the file? When I tried I did not succeed :(

@Garfield2013
Copy link

Here is a windows build tensorflow 1.12.0, CUDA 10, cuDNN 7.3.1, RTX2080TI (feature level 7.5).
tensorflow-1.12.0-cp36-cp36m-win_amd64.whl
one above in the comment I'm not sure what the CUDA feature level is, so I build a new one.

Can you please share how you make the file? When I tried I did not succeed :(
I need the file with compute level 7...

@Garfield2013
Copy link

When I try to build TF at AWS P3 V100 Tesla instance it runs 4000/6000 and then fails? What can I do to fix it?

.\tensorflow/core/kernels/mirror_pad_op_cpu_impl.h(31): note: see reference to class template instantiation 'tensorflow:
:functor::MirrorPadtensorflow::CpuDevice,tensorflow::int64,tensorflow::int64,5' being compiled
ERROR: C:/tensorflow/tensorflow/core/kernels/BUILD:3568:1: undeclared inclusion(s) in rule '//tensorflow/core/kernels:so
ftsign_op_gpu':
this rule is missing dependency declarations for the following files included by 'tensorflow/core/kernels/softsign_op_gp
u.cu.cc':
'C:/users/administrator/appdata/local/temp/nvcc_inter_files_tmp_dir/softsign_op_gpu.cu.cudafe1.stub.c'
'C:/users/administrator/appdata/local/temp/nvcc_inter_files_tmp_dir/softsign_op_gpu.cu.fatbin.c'
c:\users\administrator_bazel_administrator\xv6zejqw\execroot\org_tensorflow\external\eigen_archive\eigen\src/Core/util/
Memory.h(164): warning: calling a host function from a host device function is not allowed

later it ends with...

c:\users\administrator_bazel_administrator\xv6zejqw\execroot\org_tensorflow\external\eigen_archive\unsupported\eigen\sr
c/SpecialFunctions/SpecialFunctionsImpl.h(712): warning: missing return statement at end of non-void function "Eigen::in
ternal::igamma_series_impl<Scalar, mode>::run [with Scalar=double, mode=Eigen::internal::SAMPLE_DERIVATIVE]"
detected during:
instantiation of "Scalar Eigen::internal::igamma_series_impl<Scalar, mode>::run(Scalar, Scalar) [with Scalar
=double, mode=Eigen::internal::SAMPLE_DERIVATIVE]"
(863): here
instantiation of "Scalar Eigen::internal::igamma_generic_impl<Scalar, mode>::run(Scalar, Scalar) [with Scala
r=double, mode=Eigen::internal::SAMPLE_DERIVATIVE]"
(2108): here
instantiation of "Eigen::internal::gamma_sample_der_alpha_retval<Eigen::internal::global_math_functions_filt
ering_base<Scalar, void>::type>::type Eigen::numext::gamma_sample_der_alpha(const Scalar &, const Scalar &) [with Scalar
=double]"
c:\users\administrator_bazel_administrator\xv6zejqw\execroot\org_tensorflow\external\eigen_archive\unsupported\eigen\sr
c/SpecialFunctions/arch/CUDA/CudaSpecialFunctions.h(154): here

c:\users\administrator_bazel_administrator\xv6zejqw\execroot\org_tensorflow\external\eigen_archive\eigen\src/Core/Array
Wrapper.h(94): warning: __declspec attributes ignored

external/com_google_absl\absl/strings/string_view.h(496): warning: expression has no effect

external/protobuf_archive/src\google/protobuf/arena_impl.h(55): warning: integer conversion resulted in a change of sign

external/protobuf_archive/src\google/protobuf/arena_impl.h(309): warning: integer conversion resulted in a change of sig
n

external/protobuf_archive/src\google/protobuf/arena_impl.h(310): warning: integer conversion resulted in a change of sig
n

external/protobuf_archive/src\google/protobuf/map.h(1025): warning: invalid friend declaration

host_defines.h is an internal header file and must not be used directly. This file will be removed in a future CUDA rel
ease. Please use cuda_runtime_api.h or cuda_runtime.h instead.
Target //tensorflow/tools/pip_package:build_pip_package failed to build
INFO: Elapsed time: 1100.301s, Critical Path: 135.78s
INFO: 2335 processes: 2335 local.
FAILED: Build did NOT complete successfully
PS C:\tensorflow>

@alifare
Copy link

alifare commented Dec 17, 2018

Edited: 27-NOV-2018 Added 1.12 FINAL builds.
Edited: 29-OCT-2018 Added 1.12 RC2 builds.
Edited: 17-OCT-2018 Added 1.12 RC1 builds.

  • CUDA 10 in the nightly builds mid-DEC 2018 if testing goes well
  • CUDA 10 in official in mid-JAN-2019 due to the holiday break and code freezes. I would guess TF 1.13 (1.12 is out already)

TensorFlow will be upgrading to CUDA 10 as soon as possible. As of TF 1.11 building TensorFlow from source works just fine with CUDA 10 and possibly even before. There is nothing special needed other than all of the CUDA, cuDNN, NCCL (optional), and TensorRT (optional) libraries. If people have some builds feel free to link them here (Keep in mind if you download them to decided what risk you want to take based on the source.) as well as any issues. Also NCCL is now open source again and soon will be back to being automatically downloaded by bazel and included in the binary.

CUDA 10 would likely go into TF 1.13, which is not scheduled. I will update this post as I have more info to share. I hope to flip nightly builds to CUDA 10 in November but the TF 1.13 release will likely push to early Jan.

I and some really cool people below made some binaries (even windows, see comments below) to help people out along with my really bad "instructions". I suspect the instructions linked in the comments below are better.

Libraries used (rough list, similar to what I listed above)

  • Ubuntu 16.04 LTS on GCE from base Google Cloud Ubuntu image.
  • Python 2.7 because that is how I roll and someday I will change.
  • Install 410+ driver using apt-get
  • CUDA 10.0.130 from .tgz install (do not install the driver)
  • cuDNN-10.0 v7.3.0.29 rom tgz install
  • nccl 2.3.4 for CUDA 10 (I am not sure it matters I have mixed them up before)
  • TensorRT-5.0.0 for CUDA 10 cudnn 7.3 (TensorRT 5 was not stated as supported by Tensorflow when
  • Compute Capability: 3.7,5.2,6.0,6.1,7.0

TF 1.12.0 FINAL

Python 2.7 (Ubuntu 16.04)

Python 3.5 (Ubuntu 16.04)

TF 1.12.0 RC2

Python 2.7 (Ubuntu 16.04)

Python 3.5 (Ubuntu 16.04)

TF 1.12.0 RC1

Python 2.7 (Ubuntu 16.04)

Python 3.5 (Ubuntu 16.04)

TF 1.12.0 RC0

Python 2.7 (Ubuntu 16.04)

Python 3.5 (Ubuntu16.04)

Dockerfiles
These are partial files and the apt-get commands should work on non-Docker systems as well assuming you have the NVIDIA apt-get repositories which should be the same as listed on tf.org.

  • Devel Confirmed by building TensorFlow at 1.12RC0.
  • Runtime

Hi Toby, how's your testing going?
Is CUDA 10 supported by nightly builds now?

@byronyi
Copy link
Contributor

byronyi commented Dec 17, 2018

Should be fixed (partially at least) by 29b8d49.

@tfboyd
Copy link
Member Author

tfboyd commented Dec 17, 2018

@alifare

Nightly builds are now CUDA 10. I tested via the Docker images from dockerhub. I did not test Windows as that is something I have yet to use (sorry) for ML. This happened on Friday so there may still be some issues as we roll it out. Let me know if you run into any problems. We also moved to NCCL from source so you no longer need to install NCCL to use it.

More info in the new few weeks, I do not know if we will do an official RC before the year ends but 1.13 is likely to go final mid/end-JAN and it 100% has CUDA 10 and cuDNN 7.4 (you could always update this by just adding a new binary).

@alifare
Copy link

alifare commented Dec 18, 2018

@alifare

Nightly builds are now CUDA 10. I tested via the Docker images from dockerhub. I did not test Windows as that is something I have yet to use (sorry) for ML. This happened on Friday so there may still be some issues as we roll it out. Let me know if you run into any problems. We also moved to NCCL from source so you no longer need to install NCCL to use it.

More info in the new few weeks, I do not know if we will do an official RC before the year ends but 1.13 is likely to go final mid/end-JAN and it 100% has CUDA 10 and cuDNN 7.4 (you could always update this by just adding a new binary).

Thank a lot @tfboyd
I installed the nightly builds yesterday and I am testing it with sample codes now. Everyting seems to be going well now. I will let you know if I find any probelm in future.

@WOLVIE97
Copy link

Well, if this root is good, how do i incorporate it into my system?

@clhne
Copy link

clhne commented Dec 25, 2018

@tfboyd Hey could you kindly share the .whl of tensorflow 1.11 with,

CUDA 10.0.130_410.48
cuDNN 7.3.0.29 for CUDA 10
I was not able to build the same. I have a rtx 2080 with ubuntu 16.04. Thanks in advance

Me too.
Is there any info?

@clhne
Copy link

clhne commented Dec 25, 2018

@rafaellopezgarcia
Copy link

@alifare
Nightly builds are now CUDA 10. I tested via the Docker images from dockerhub. I did not test Windows as that is something I have yet to use (sorry) for ML. This happened on Friday so there may still be some issues as we roll it out. Let me know if you run into any problems. We also moved to NCCL from source so you no longer need to install NCCL to use it.
More info in the new few weeks, I do not know if we will do an official RC before the year ends but 1.13 is likely to go final mid/end-JAN and it 100% has CUDA 10 and cuDNN 7.4 (you could always update this by just adding a new binary).

Thank a lot @tfboyd
I installed the nightly builds yesterday and I am testing it with sample codes now. Everyting seems to be going well now. I will let you know if I find any probelm in future.

Thank you guys, it also works for me as well with a 1080 ti

@NoelKennedy
Copy link

NoelKennedy commented Jan 7, 2019

I installed the nightly docker image which works in general but I get an error message when trying to fit a keras.layers.Conv1D layer on my GPU (RTX 2070)

UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.

Is cuDNN installed on the tensorflow nightly containers? Or is this the wrong question!

@bhagathkumar
Copy link

bhagathkumar commented Jan 8, 2019

my GPU : RTX 2070
Ubuntu 16.04
Python 3.5.2
Nvidia Driver 410.78
CUDA - 10.0.130
cuDNN-10.0 - 7.4.2.24
TensorRT-5.0.0
Compute Capability: 7.5

tensorflow-1.13.0rc0-cp35-cp35m-linux_x86_64

@Alnlll
Copy link

Alnlll commented Jan 11, 2019

Tried some wheels in #22706 (comment),
I keep getting such error when using tensorRT int8 inference:

2019-01-11 01:54:24.555172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-01-11 01:54:25.902674: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-11 01:54:25.902717: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-01-11 01:54:25.902726: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-01-11 01:54:25.924944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3035 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
2019-01-11 01:54:30.056922: I tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:502] import/InceptionResnetV1/my_trt_op_0 Constructing a new engine with batch size 32

=================================================================
**2019-01-11 01:54:30.361891: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger ../builder/cudnnBuilder2.cpp (1508) - Misc Error in buildEngine: -1 (Could not find tensor (Unnamed ITensor* 3) in tensorScales.)
2019-01-11 01:54:30.424656: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger ../builder/cudnnBuilder2.cpp (1508) - Misc Error in buildEngine: -1 (Could not find tensor (Unnamed ITensor* 3) in tensorScales.)
2019-01-11 01:54:30.425290: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:516] Engine creation for batch size 32 failed Internal: Failed to build TensorRT engine
2019-01-11 01:54:30.425308: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:287] Engine retrieval for batch size 1 failed. 
=================================================================

Also I built a wheel based on nvidia tensorflow docker 18.12(py27), the same error.

Does anyone run into this error and solve it?

@redna11
Copy link

redna11 commented Jan 12, 2019

my GPU : RTX 2070
Ubuntu 16.04
Python 3.5.2
Nvidia Driver 410.78
CUDA - 10.0.130
cuDNN-10.0 - 7.4.2.24
TensorRT-5.0.0
Compute Capability: 7.5

tensorflow-1.13.0rc0-cp35-cp35m-linux_x86_64

Does it include nccl 2.3.7? I have installed nccl but TF cannot find it.

@pooyadavoodi
Copy link

pooyadavoodi commented Jan 13, 2019

Tried some wheels in #22706 (comment),
I keep getting such error when using tensorRT int8 inference:

2019-01-11 01:54:24.555172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-01-11 01:54:25.902674: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-11 01:54:25.902717: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-01-11 01:54:25.902726: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-01-11 01:54:25.924944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3035 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
2019-01-11 01:54:30.056922: I tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:502] import/InceptionResnetV1/my_trt_op_0 Constructing a new engine with batch size 32

=================================================================
**2019-01-11 01:54:30.361891: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger ../builder/cudnnBuilder2.cpp (1508) - Misc Error in buildEngine: -1 (Could not find tensor (Unnamed ITensor* 3) in tensorScales.)
2019-01-11 01:54:30.424656: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger ../builder/cudnnBuilder2.cpp (1508) - Misc Error in buildEngine: -1 (Could not find tensor (Unnamed ITensor* 3) in tensorScales.)
2019-01-11 01:54:30.425290: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:516] Engine creation for batch size 32 failed Internal: Failed to build TensorRT engine
2019-01-11 01:54:30.425308: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:287] Engine retrieval for batch size 1 failed. 
=================================================================

Also I built a wheel based on nvidia tensorflow docker 18.12(py27), the same error.

Does anyone run into this error and solve it?

Perhaps you haven't run calibration which is required for INT8 unless you manually insert dynamic ranges. https://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html#tutorial-tftrt-int8

Also I think you are using an old TF (perhaps 1.12?) because we have renamed my_trt_op to TRTOpEngine.

@Alnlll
Copy link

Alnlll commented Jan 14, 2019

@pooyadavoodi Thanks for responding!!

  1. It's a calibrated graph running correctly in Nvidia TensorFlow docker 18.12. Got this error when trying to run on TensorFlow generated from source as mentioned.
  2. Yes, I'm using tensorflow r1.12 just for matching the docker, will give a try on r1.13.
  3. Noticed that some nccl related files miss in all tried tensorflow compared to nvidia docker version:
/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nccl_ops.py
/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nccl_ops.py
...

@michaelmyc
Copy link

Took my 15 days to followup, sorry. This is the timeline I am seeing right now:

  • CUDA 10 in the nightly builds mid-DEC if testing goes well
  • CUDA 10 in official in mid-JAN due to the holiday break I would guess TF 1.13 (1.12 is out already)

I am adding this to the original post above as well.

Are there any updates on 1.13 release date?

@Lxnus
Copy link

Lxnus commented Jan 24, 2019

Hey,
maybe look at my CUDA 10 build for Tensorflow. It works fine. I habe a GTX 2080, 64GB RAM. I have no problems to work with this build.

https://github.com/Lxnus/tensorflow_r1.12_cuda_10

@NoelKennedy
Copy link

Hi all,

I've managed to get this working and haven't experienced any issues for a while.

TLDR: Install CUDA 10, build Tensorflow locally. Once this is running you need to tell TF to use float16 and adjust the epsilon or you will get NaN during training.

import keras.backend as K

dtype='float16'
K.set_floatx(dtype)

# default is 1e-7 which is too small for float16.  Without adjusting the epsilon, we will get NaN predictions because of divide by zero
K.set_epsilon(1e-4) 

@zdx198811
Copy link

I've managed to build it in Ubuntu 18.04, with Python 3.6, Cuda 10.0, cudnn 7.4.2. GPU model RTX 2080Ti. Here is the .whl:
tensorflow-1.12.0-cp36-cp36m-linux_x86_64.whl
I used -march=native optimization flag, and my CPU is Xeon X5680 that supports SSE4.2 instruction set extensions. Not sure whether this leads to some incompatibility issues with other CPU models. Good luck.

@KtK99
Copy link

KtK99 commented Feb 6, 2019

Hi @tfboyd , I have the same question as @tydlwav , any updates on the 1.13 release? I am coding in R using Keras and using the RTX 2080. Really looking forward to this release.

@gunan
Copy link
Contributor

gunan commented Feb 6, 2019

rc0 is already out:
https://github.com/tensorflow/tensorflow/releases/tag/v1.13.0-rc0

All our nightlies also have cuda 10 support for a while now.

@Kamlesh364
Copy link

I've built tf 1.12.0rc0 with all the latest, test well except this silly warning

*** WARNING *** You are using ptxas 10.0.145, which is older than 9.2.88. ptxas 9.x before 9.2.88 is known to miscompile XLA code, leading to incorrect results or invalid-address errors.

You do not need to update to CUDA 9.2.88; cherry-picking the ptxas binary is sufficient.

seems no problem for now

only thing left is python 3.7, which I haven't tried yet

*** WARNING *** You are using ptxas 10.1.243, which is older than 11.1. ptxas before 11.1 is known to miscompile XLA code, leading to incorrect results or invalid-address errors.

I am facing this issue for Tensorflow==2.7.0 in ubuntu20.04. Please let me know how did you solve your issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests