Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If you are running into problems with TensorFlow #173

Open
chogovadze opened this issue Oct 27, 2020 · 6 comments
Open

If you are running into problems with TensorFlow #173

chogovadze opened this issue Oct 27, 2020 · 6 comments

Comments

@chogovadze
Copy link

chogovadze commented Oct 27, 2020

Hello everyone,
It seems that several users are reporting the same kind of obstacles with regards to training/predicting.
After research, this problem appears to be a compatibility issue of old versions of tensorflow 1.x and newer GPUs when installing through pip. Compiling tensorflow from source resolves this issue however it is very time-consuming. I hope this write up could help other users that are having trouble with their environment.

This method requires the use of conda.

  1. Create a new conda environment and simply run: conda install tensorflow-gpu=1.12 (conda will automatically pull the correct cuda/cudnn versions).
  2. Once installation is complete, remove the tensorflow-gpu==1.12 from requirement.txt and run the makefile.
  3. Change all batch_size and eval_batch_size in the config files to 1.
  4. Finally run export TF_FORCE_GPU_ALLOW_GROWTH=true followed by export TMPDIR=/tmp/ in your current terminal session.

If you are still having issues be sure that you have NOT:

  • Used an old conda environment with cuda/cudnn already configured.
  • Installed cuda/cudnn separately with the command conda install cudnn=x.x.x=cudax.x_x.
  • Run the makefile within the new conda environment before the aforementioned steps, thus installing tensorflow through pip.

References from:

I have successfully worked with this repository with the following setup:

  • Ubuntu 18.04
  • Ryzen 3700
  • GTX 2070s (8GB)

If you are still having some issues, please do not hesitate to reach out.

@paragghosh
Copy link

paragghosh commented Aug 10, 2021

@chogovadze , Thanks for outlining the steps here. I was having the same issues described here and followed the steps to fix the TF version and CUDA version incompatibility. After finishing these steps I got an error when I tried to run superpoint (script export_detections.py):
ImportError: No module named 'superpoint'
Following the thread #206 I did another round of make install. It finished fine but I am still getting the same error. Any ideas?

@paragghosh
Copy link

I realized my error - I was pointing to my earlier venv in the makefile. After I removed that, I reran make install (which reinstalled superpoint). However, now I am getting the following error when I try to run the export_detections.py script:
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

@David-Willo
Copy link

For those who have difficulties running on GPUs that can't match lower version CUDA (3080 in my case),
try switching to NVIDIA's TensorFlow repo https://github.com/NVIDIA/tensorflow#install
this solves my issue.

@iMeleon
Copy link

iMeleon commented Oct 27, 2023

For those who have difficulties running on GPUs that can't match lower version CUDA (3080 in my case), try switching to NVIDIA's TensorFlow repo https://github.com/NVIDIA/tensorflow#install this solves my issue.

Thanks. Solve my issue with loss nan, precision nan, recall 0.0000 on RTX 3090.

@20181313zhang
Copy link

对于那些在无法与较低版本的 CUDA(就我而言为 3080)相匹配的 GPU 上运行困难的人,请尝试切换到 NVIDIA 的 TensorFlow 存储库 https://github.com/NVIDIA/tensorflow#install 这样可以解决我的问题。

谢谢。解决我在 RTX 3090 上的损失 nan、精度 nan、召回 0.0000 的问题。

你好,我的是RTX3080Ti,请问你的训练成功了吗?希望可以联系一下,可以相互学习学习,感谢

@vegetable233
Copy link

对于那些在无法与较低版本的 CUDA(就我而言为 3080)相匹配的 GPU 上运行困难的人,请尝试切换到 NVIDIA 的 TensorFlow 存储库 https://github.com/NVIDIA/tensorflow#install 这样可以解决我的问题。

谢谢。解决我在 RTX 3090 上的损失 nan、精度 nan、召回 0.0000 的问题。

你好,我的是RTX3080Ti,请问你的训练成功了吗?希望可以联系一下,可以相互学习学习,感谢

我在训练magicpoint的时候也遇到了loss nan的问题,请问您解决了吗?可以加QQ 972048746联系一下

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants