Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA 7.5 fails with pip install and docker (Ubuntu 14.04) #20

Closed
soumith opened this issue Nov 9, 2015 · 54 comments
Closed

CUDA 7.5 fails with pip install and docker (Ubuntu 14.04) #20

soumith opened this issue Nov 9, 2015 · 54 comments
Assignees

Comments

@soumith
Copy link

soumith commented Nov 9, 2015

Installing via:

# For GPU-enabled version (only install this version if you have the CUDA sdk installed)
$ pip install https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.5.0-cp27-none-linux_x86_64.whl

Tried to run the alexnet_benchmark.py and it's looking for CUDA 7.0 specifically.

I have CUDA 7.5 on my machine.

Full stack:

Traceback (most recent call last):
  File "alexnet_benchmark.py", line 21, in <module>
    import tensorflow.python.platform
  File "/home/awesomebox/anaconda/lib/python2.7/site-packages/tensorflow/__init__.py", line 4, in <module>
    from tensorflow.python import *
  File "/home/awesomebox/anaconda/lib/python2.7/site-packages/tensorflow/python/__init__.py", line 22, in <module>
    from tensorflow.python.client.client_lib import *
  File "/home/awesomebox/anaconda/lib/python2.7/site-packages/tensorflow/python/client/client_lib.py", line 35, in <module>
    from tensorflow.python.client.session import InteractiveSession
  File "/home/awesomebox/anaconda/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 11, in <module>
    from tensorflow.python import pywrap_tensorflow as tf_session
  File "/home/awesomebox/anaconda/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 28, in <module>
    _pywrap_tensorflow = swig_import_helper()
  File "/home/awesomebox/anaconda/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow', fp, pathname, description)
ImportError: libcudart.so.7.0: cannot open shared object file: No such file or directory

Tried the docker install, but the docker image is configured for a particular NVIDIA driver version, and doesn't work with others. (this is a known issue: docker driver version and system driver version must exactly match)

@soumith soumith changed the title CUDA 7.5 fails with pip install and docker CUDA 7.5 fails with pip install and docker (Ubuntu 14.04) Nov 9, 2015
@soumith
Copy link
Author

soumith commented Nov 9, 2015

@nivwusquorum haha that's a terrible workaround, as it'll start issues with other libraries. Thanks a lot though. I'm installing CUDA 7.0

@lukesimo
Copy link

lukesimo commented Nov 9, 2015

@nivwusquorum no.

I'm running into the same issue. Looks like I'll be downgrading then.

@ebrevdo
Copy link
Contributor

ebrevdo commented Nov 9, 2015

Out of curiosity, have you set your LD_LIBRARY_PATH to your cuda installation's lib64 directory?

@lukesimo
Copy link

lukesimo commented Nov 9, 2015

@ebrevdo yes

printenv LD_LIBRARY_PATH
/usr/local/cuda-7.5/lib64

@mdda
Copy link

mdda commented Nov 9, 2015

On the subject of CUDA library versions ... CUDA 7.0 works for me (as expected), but it really insists on cuDNN 6.5 (which Nvidia now has as 'legacy').

Exact same library locations, etc, but downgrading from cuDNN 7.0 to 6.5 worked.

@graphific
Copy link

yes its within the tensorflow code, so just some simple python hacking wont solve it :)
(_pywrap_tensorflow.so when you pip install the binary):

_mod = imp.load_module('_pywrap_tensorflow', fp, pathname, description)
ImportError: libcudart.so.7.0: cannot open shared object file: No such file or directory

@emergix
Copy link

emergix commented Nov 15, 2015

i assume i have same problem:
I have the 7.5 installed with tensorflow and when I try (like in the tutorial about gpu)
with tf.device('/gpu:0'):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
print(c)
sess.run(c)

it breaks !
(in torch7, I have no pbs with gpus)

@jimaldon
Copy link

Can we assume tensorflow to be forward compatible with cuda 7.5?

@andorremus
Copy link

I've got the same problem.

So are you saying that downgrading from cuda 7.5 to 7 should do the trick?

@andorremus
Copy link

If it helps, I've installed cuda toolkit 7.0 and changed the bash profile reference to the new one and it works.

@emergix
Copy link

emergix commented Nov 17, 2015

yes you do not need to reinstall the cuda 7.0 driver, just provide the path to libcudart.so.7.0 at the end of the LD_LIBRARY_PATH variable. After discussing of it with someone of the team, they told me a strange story. It appears thar the old CUDA drivers are very much in demand by the people using AWS of amazon !
can someone confirm ?

@ebrevdo
Copy link
Contributor

ebrevdo commented Nov 23, 2015

Long story short: tensorflow currently requires cuda 7.0. If you install version 7.0 in a separate directory from 7.5, and point tensorflow at it via the configure script (or LD_LIBRARY_PATH), it will work. Leaving this open to track future upgrades to the 7.5 SDK.

@FabHan
Copy link

FabHan commented Dec 8, 2015

I'm curious why it is hard to upgrade to CUDA 7.5 and CuDNN v3? Anyone helps me understand?

@pannous
Copy link

pannous commented Dec 9, 2015

@emergix so you got it to work just by providing libcudart.so.7.0 without reinstalling / downgrading to old cuda?

@andorremus
Copy link

Yes. It doesn't require you to uninstall the previous one. You can just install them separately and reference v7.0 in your bashrc file

@pannous
Copy link

pannous commented Dec 9, 2015

Thanks! Workaround confirmed here.

@fivejjs
Copy link

fivejjs commented Jan 4, 2016

symbolic link libcuda_.7.5 to libcuda_.7.0
It works.

@esube
Copy link

esube commented Jan 14, 2016

This issue and co. are open for more than two months now. Is there any progress to support 7.5 cuda and 7.0 cudnn other than the workarounds? This could be a turn off for some people who already have been working with 7.5 cuda for a while.

@kmhofmann
Copy link
Contributor

I completely agree. Ideally (assuming reasonable API/ABI stability on NVIDIA's side), TensorFlow should not be dependent on specific older versions of CUDA and cuDNN. (I'd rather understand it if the latest version was required to make use of certain features.)
This is the one issue that puts me off using TensorFlow with GPU support.

@cmcneil
Copy link

cmcneil commented Jan 16, 2016

+1 Please support CUDA 7.5

@ville-k
Copy link
Contributor

ville-k commented Jan 16, 2016

If you want to try out CUDA 7.5 under Linux, you could try building my pull request branch:
#664
It adds support for CUDA on OSX and uses CUDA 7.5 when building under OSX. I haven't tried 7.5 under Linux, but it seems to work ok with OSX.
The only change you'd need to make is to edit the configure file and set CUDA_VERSION='7.5' when Linux is detected. Lines 48-49 would look like:

if [ "$OSNAME" == "Linux" ]; then
  CUDA_VERSION='7.5'

pooyadavoodi pushed a commit to pooyadavoodi/tensorflow that referenced this issue Oct 16, 2019
Add use_explicit_batch parameter available in OpConverterParams and other places

Formatting and make const bool everywhere

Enable use_explicit_batch for TRT 6.0

Revise validation checks to account for use_explicit_batch. Propagate flag to ConversionParams and TRTEngineOp

Rename use_explicit_batch/use_implicit_batch

Formatting

Add simple activtion test for testing dynamic input shapes. Second test with None dims is disabled

Update ConvertAxis to account for use_implicit batch

fix use of use_implicit_batch (tensorflow#7)

* fix use of use_implicit_batch

* change order of parameters in ConvertAxis function

fix build (tensorflow#8)

Update converters for ResNet50 (except Binary ops) (tensorflow#9)

* Update RN50 converters for use_implicit_batch: Conv2D, BiasAdd, Transpose, MaxPool, Squeeze, MatMul, Pad

* Fix compilation errors

* Fix tests

Use TRT6 API's for dynamic shape (tensorflow#11)

* adding changes for addnetworkv2

* add plugin utils header file in build

* optimization profile api added

* fix optimization profile

* TRT 6.0 api changes + clang format

* Return valid errors in trt_engine_op

* add/fix comments

* Changes to make sure activation test passes with TRT trunk

* use HasStaticShape API, add new line at EOF

Allow opt profiles to be set via env variables temporarily.

Undo accidental change

 fix segfault by properly returning the status from OverwriteStaticDims function

Update GetTrtBroadcastShapes for use_implicit_batch (tensorflow#14)

* Update GetTrtBroadcastShapes for use_implicit_batch

* Formatting

Update activation test

Fix merge errors

Update converter for reshape (tensorflow#17)

Allow INT32 for elementwise (tensorflow#18)

Add Shape op (tensorflow#19)

* Add Shape op

* Add #if guards for Shape. Fix formatting

Support dynamic shapes for strided slice (tensorflow#20)

Support dynamic shapes for strided slice

Support const scalars + Pack on constants (tensorflow#21)

Support const scalars and pack with constants in TRT6

Fixes/improvements for BERT (tensorflow#22)

* Support shrink_axis_mask for StridedSlice

* Use a pointer for final_shape arg in ConvertStridedSliceHelper. Use final_shape for unpack/unstack

* Support BatchMatMulV2.

* Remove TODO and update comments

* Remove unused include

* Update Gather for TRT 6

* Update BatchMatMul for TRT6 - may need more changes

* Update StridedSlice shrink_axis for TRT6

* Fix bugs with ConvertAxis, StridedSlice shrink_axis, Gather

* Fix FC and broadcast

* Compile issue and matmul fix

* Use nullptr for empty weights

* Update Slice

* Fix matmul for TRT6

* Use enqueueV2. Don't limit to 1 input per engine

Change INetworkConfig to IBuilderConfig

Allow expand dims to work on dynamic inputs by slicing shape. Catch problems with DepthwiseConv. Don't try to verify dynamic shapes in CheckValidSize (tensorflow#24)

Update CombinedNMS converter (tensorflow#23)

* Support CombinedNMS in non implicit batch mode. The squeeze will not work if multiple dimensions are unknown

* Fix compile error and formatting

Support squeeze when input dims are unknown

Support an additional case of StridedSlice where some dims aren't known

Use new API for createNetworkV2

Fix flag type for createNetworkV2

Use tensor inputs for strided slice

Allow squeeze to work on -1 dims

Add TRT6 checks to new API

spliting ConvertGraphDefToEngine  (tensorflow#29)

* spliting ConvertGraphDefToEngine into ConvertGraphDefToNetwork and BuildEngineFromNetwork

* some compiler error

* fix format

Squeeze Helper function (tensorflow#31)

* Add squeeze helper

* Fix compile issues

* Use squeeze helper for CombinedNMS

Update Split & Unpack for dynamic shapes (tensorflow#32)

* Update Unpack for dynamic shapes

* Fix compilation error

Temporary hack to fix bug in config while finding TRT library

Fix errors from rebasing

Remove GatherV2 limitations for TRT6

Fix BiasAdd elementwise for NCHW case with explicit batch mode (tensorflow#34)

Update TRT6 headers, Make tests compile (tensorflow#35)

* Change header files for TRT6 in configure script

* Fix bug with size of scalars. Use implicit batch mode based on the converter flag when creating network

* Fix compilation of tests and Broadcast tests

Properly fix biasadd nchw (tensorflow#36)

Revert tensorflow#29 to fix weight corruption (tensorflow#37)

* Revert tensorflow#29 to fix weight corruption

* Revert change in test

Fix bug with converters and get all tests passing for TRT6 (tensorflow#39)

Update DepthToSpace and SpaceToTest for TRT6 + dynamic shapes (tensorflow#40)

Add new C++ tests for TRT6 converters (tensorflow#41)

* Remove third shuffle layer since bug with transpose was fixed

* Add new tests for TRT6 features

* Update TRT6 headers list

Fix compilation errors

Remove bazel_build.sh

Enable quantization mnist test back

Disabled by mistake I believe

Remove undesirable changes in quantization_mnist_test

Add code back that was missed during rebase

Fix bug: change "type" to type_key
keithm-xmos referenced this issue in xmos/tensorflow Feb 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests