no CUDA-capable device is detected #3265

jhpenger · 2018-11-07T11:01:05Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): docker Ubuntu 16.04 image
Ray installed from (source or binary): pip
Ray version: 0.5.3
Python version: Python 3.5.6 :: Anaconda, Inc.
Exact command to reproduce:

Describe the problem

Trying to setup a rllib ppo agent with husky_env from Gibson Env
The script I ran can be found here

I am getting the following Error when calling agent.train():

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=74 error=38 : no CUDA-capable device is detected

Gibson does the environment rendering upon environment creation, and rllib agent's seems to invoke env_creator every time train() is called. I originally thought that was the issue but I don't think it is the case
I tried using gpu_fraction, didn't work. Not sure what is causing the problem.

nvidia-smi

root@e6b154065e88:~/mount/gibson/examples/train# nvidia-smi
Wed Nov  7 09:59:00 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.73       Driver Version: 410.73       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 00000000:04:00.0  On |                  N/A |
| 22%   42C    P8    20W / 250W |   2385MiB / 12198MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

torch.cuda.device_count()

root@e6b154065e88:~# python -c "import torch
print(torch.cuda.device_count())
print(torch.cuda.current_device())"
1
0

nvcc --version

root@e6b154065e88:~# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

To Reproduce

Get Nvidia-Docker2

https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)

#Ubuntu Installation
sudo apt-get install nvidia-docker2
sudo pkill -SIGHUP dockerd

Download Gibson's dataset

wget https://storage.googleapis.com/gibsonassets/dataset.tar.gz
tar -zxf dataset.tar.gz

Pull Gibson's image

docker pull xf1280/gibson:0.3.1

Run it in Docker

replace <dataset-absolute-path> with the absolute path to the Gibson dataset you've unzipped on your local machine

docker run --runtime=nvidia -ti --name gibson -v <dataset-absolute-path>:/root/mount/gibson/gibson/assets/dataset -p 5001:5001 xf1280/gibson:0.3.1

Add in the ray_husky.py script

Copy the ray_husky.py found here to ~/mount/gibson/examples/train/ directory in the docker container.

Run: python ray_husky.py

Full Log

root@e6b154065e88:~/mount/gibson/examples/train# python test.py
Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/4IFU7EUC3V2BOPDL2NFLW6T7BY:/var/lib/docker/overlay2/l/3GWVT6ULAU6NJP6MLTBNN56WBQ:/var/lib/docker/overlay2/l/CLLJDJFTZ2FMCKCN6B3WMCSXKG:/var/lib/docker/overlay2/l/QCO5RAE5DXB7MGGYLTK3YULY2O:/var/lib/docker/overlay2/l/NFJ7MEC3G7XLHLZMZWKKHLIM5Y:/var/lib/docker/overlay2/l/3LGFVLYHAWSN7GNAOYGCWVQK3Y:/var/lib/docker/overlay2/l/Q2BQDGXUX3SFP3RQYQDXOPWPSD:/var/lib/docker/overlay2/l/O5I6APSGOJZV4RFU7EOXVT5BWD:/var/lib/docker/overlay2/l/E4DOAELV7FPI6'
Unexpected end of /proc/mounts line `7XTB5ASEF7ESL:/var/lib/docker/overlay2/l/4BPII7VWNXTHZDYHMZQQ47WVGK:/var/lib/docker/overlay2/l/5RZ3I4FBOEGIAACNUMNPNJIIMM:/var/lib/docker/overlay2/l/JUDMTQV6ZO3CYJ64OCHUEOIDS4:/var/lib/docker/overlay2/l/WXFZP4STEX7JZ5S5VQCQR2MTDB:/var/lib/docker/overlay2/l/MUODDE6AS2PD6QOD6BXFE5JWN4:/var/lib/docker/overlay2/l/NV2EHBVA5EICRKTEGR3F4NADEC:/var/lib/docker/overlay2/l/MZVP7SBXRC7X7IKJKYHYQK6YOK:/var/lib/docker/overlay2/l/SVE4WWKXOSQOO2O3QQDMHW5TVB:/var/lib/docker/overlay2/l/NDRFI4BJ3ZGXEYSVAABQB6Z2OQ:/var/lib/do'
Unexpected end of /proc/mounts line `cker/overlay2/l/YTU432I3FDCY7GE4NT5VVR47GN:/var/lib/docker/overlay2/l/VCTBKUJHFQQQTCZRSPPZQKDIDZ:/var/lib/docker/overlay2/l/TR4DD4VR545GC7WIKUS5UDNRSM:/var/lib/docker/overlay2/l/BFRVMK6XAWSUK4JFRBYEOWQA4B:/var/lib/docker/overlay2/l/DLRGX3CDMNWDK66CSZZNXMTRTP:/var/lib/docker/overlay2/l/IPOZCPD7GVR3P3ECGOTQWPJ737:/var/lib/docker/overlay2/l/X6WEEMZQY3LGKMQELCNCCWVVHH:/var/lib/docker/overlay2/l/7APKFGZZGMNJ7BXSRL7A3WFVI6:/var/lib/docker/overlay2/l/PE6OSOUQSWBVJMTELFCNCFEG7X:/var/lib/docker/overlay2/l/FHHGDNFDT'
Unexpected end of /proc/mounts line `A32ESWYKQJTKH77LR:/var/lib/docker/overlay2/l/VEP2IVXB7LSMARPAJOF2SGEWTA:/var/lib/docker/overlay2/l/EAPK6KKCRU7YHHL6QVKDLQKSAH:/var/lib/docker/overlay2/l/5SZECZZ64ECDDARDWCQ2QOH2PY:/var/lib/docker/overlay2/l/XAL23ADNRDHSDATFJJSD3HA5T2:/var/lib/docker/overlay2/l/V7MN4H5N26LKKYRY4JGORHE4PI:/var/lib/docker/overlay2/l/3E3ILIVYCBQ52OYJLKCSZXAYPD:/var/lib/docker/overlay2/l/B4GW3N34A6DMEUWEO24TKYCJIW:/var/lib/docker/overlay2/l/XM3K5GW7VB5HRODVU7CTK5HUGD:/var/lib/docker/overlay2/l/7QHY2DH3GUNNMTOYULZIOK6F6O:/var/li'
pybullet build time: Sep 27 2018 00:17:23
pygame 1.9.4
Hello from the pygame community. https://www.pygame.org/contribute.html
Process STDOUT and STDERR is being redirected to /tmp/raylogs/.
Waiting for redis server at 127.0.0.1:46828 to respond...
Waiting for redis server at 127.0.0.1:15517 to respond...
Warning: Reducing object store memory because /dev/shm has only 67104768 bytes available. You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
Starting the Plasma object store with 0.00 GB memory.
Starting local scheduler with the following resources: {'CPU': 32, 'GPU': 1}.
Failed to start the UI, you may need to run 'pip install jupyter'.
Created LogSyncer for /root/ray_results/PPO_test_2018-11-07_09-49-37kxrhxuku -> None
/root/mount/gibson/examples/train/../configs/husky_navigate_rgb_train.yaml
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
Processing the data:
Total 1 scenes 0 train 1 test
Indexing
  0%|                                                                                                                                                                                 | 0/1 [00:00<?, ?it/s]number of devices found 1
Loaded EGL 1.5 after reload.
GL_VENDOR=NVIDIA Corporation
GL_RENDERER=GeForce GTX TITAN X/PCIe/SSE2
GL_VERSION=4.6.0 NVIDIA 410.73
GL_SHADING_LANGUAGE_VERSION=4.60 NVIDIA
finish loading shaders
100%|#########################################################################################################################################################################| 1/1 [00:00<00:00,  1.99it/s]
  9%|###############7                                                                                                                                                      | 18/190 [00:01<02:14,  1.28it/s]terminate called after throwing an instance of 'zmq::error_t'
  what():  Address already in use
100%|#####################################################################################################################################################################| 190/190 [00:12<00:00, 16.75it/s]
/root/mount/gibson/gibson/core/render/pcrender.py:204: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  self.imgv = Variable(torch.zeros(1, 3 , self.showsz, self.showsz), volatile = True).cuda()
/root/mount/gibson/gibson/core/render/pcrender.py:205: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  self.maskv = Variable(torch.zeros(1,2, self.showsz, self.showsz), volatile = True).cuda()
Episode: steps:0 score:0
Episode count: 0
/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/functional.py:995: UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.
  warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.")
Episode: steps:0 score:0
Episode count: 1
LocalMultiGPUOptimizer devices ['/gpu:0']
Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/4IFU7EUC3V2BOPDL2NFLW6T7BY:/var/lib/docker/overlay2/l/3GWVT6ULAU6NJP6MLTBNN56WBQ:/var/lib/docker/overlay2/l/CLLJDJFTZ2FMCKCN6B3WMCSXKG:/var/lib/docker/overlay2/l/QCO5RAE5DXB7MGGYLTK3YULY2O:/var/lib/docker/overlay2/l/NFJ7MEC3G7XLHLZMZWKKHLIM5Y:/var/lib/docker/overlay2/l/3LGFVLYHAWSN7GNAOYGCWVQK3Y:/var/lib/docker/overlay2/l/Q2BQDGXUX3SFP3RQYQDXOPWPSD:/var/lib/docker/overlay2/l/O5I6APSGOJZV4RFU7EOXVT5BWD:/var/lib/docker/overlay2/l/E4DOAELV7FPI6'
Unexpected end of /proc/mounts line `7XTB5ASEF7ESL:/var/lib/docker/overlay2/l/4BPII7VWNXTHZDYHMZQQ47WVGK:/var/lib/docker/overlay2/l/5RZ3I4FBOEGIAACNUMNPNJIIMM:/var/lib/docker/overlay2/l/JUDMTQV6ZO3CYJ64OCHUEOIDS4:/var/lib/docker/overlay2/l/WXFZP4STEX7JZ5S5VQCQR2MTDB:/var/lib/docker/overlay2/l/MUODDE6AS2PD6QOD6BXFE5JWN4:/var/lib/docker/overlay2/l/NV2EHBVA5EICRKTEGR3F4NADEC:/var/lib/docker/overlay2/l/MZVP7SBXRC7X7IKJKYHYQK6YOK:/var/lib/docker/overlay2/l/SVE4WWKXOSQOO2O3QQDMHW5TVB:/var/lib/docker/overlay2/l/NDRFI4BJ3ZGXEYSVAABQB6Z2OQ:/var/lib/do'
Unexpected end of /proc/mounts line `cker/overlay2/l/YTU432I3FDCY7GE4NT5VVR47GN:/var/lib/docker/overlay2/l/VCTBKUJHFQQQTCZRSPPZQKDIDZ:/var/lib/docker/overlay2/l/TR4DD4VR545GC7WIKUS5UDNRSM:/var/lib/docker/overlay2/l/BFRVMK6XAWSUK4JFRBYEOWQA4B:/var/lib/docker/overlay2/l/DLRGX3CDMNWDK66CSZZNXMTRTP:/var/lib/docker/overlay2/l/IPOZCPD7GVR3P3ECGOTQWPJ737:/var/lib/docker/overlay2/l/X6WEEMZQY3LGKMQELCNCCWVVHH:/var/lib/docker/overlay2/l/7APKFGZZGMNJ7BXSRL7A3WFVI6:/var/lib/docker/overlay2/l/PE6OSOUQSWBVJMTELFCNCFEG7X:/var/lib/docker/overlay2/l/FHHGDNFDT'
Unexpected end of /proc/mounts line `A32ESWYKQJTKH77LR:/var/lib/docker/overlay2/l/VEP2IVXB7LSMARPAJOF2SGEWTA:/var/lib/docker/overlay2/l/EAPK6KKCRU7YHHL6QVKDLQKSAH:/var/lib/docker/overlay2/l/5SZECZZ64ECDDARDWCQ2QOH2PY:/var/lib/docker/overlay2/l/XAL23ADNRDHSDATFJJSD3HA5T2:/var/lib/docker/overlay2/l/V7MN4H5N26LKKYRY4JGORHE4PI:/var/lib/docker/overlay2/l/3E3ILIVYCBQ52OYJLKCSZXAYPD:/var/lib/docker/overlay2/l/B4GW3N34A6DMEUWEO24TKYCJIW:/var/lib/docker/overlay2/l/XM3K5GW7VB5HRODVU7CTK5HUGD:/var/lib/docker/overlay2/l/7QHY2DH3GUNNMTOYULZIOK6F6O:/var/li'
pybullet build time: Sep 27 2018 00:17:23
pygame 1.9.4
Hello from the pygame community. https://www.pygame.org/contribute.html
/root/mount/gibson/examples/train/../configs/husky_navigate_rgb_train.yaml
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
Processing the data:
Total 1 scenes 0 train 1 test
Indexing
  0%|                                                                                                                                                                                 | 0/1 [00:00<?, ?it/s]number of devices found 1
Loaded EGL 1.5 after reload.
GL_VENDOR=NVIDIA Corporation
GL_RENDERER=GeForce GTX TITAN X/PCIe/SSE2
GL_VERSION=4.6.0 NVIDIA 410.73
GL_SHADING_LANGUAGE_VERSION=4.60 NVIDIA
finish loading shaders
100%|#########################################################################################################################################################################| 1/1 [00:00<00:00,  1.74it/s]
 11%|#################4                                                                                                                                                    | 20/190 [00:02<00:47,  3.56it/s]terminate called after throwing an instance of 'zmq::error_t'
  what():  Address already in use
100%|#####################################################################################################################################################################| 190/190 [00:12<00:00, 16.88it/s]
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=74 error=38 : no CUDA-capable device is detected
Remote function __init__ failed with:

Traceback (most recent call last):
  File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 945, in _process_task
    *arguments)
  File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/actor.py", line 261, in actor_method_executor
    method_returns = method(actor, *args)
  File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/evaluation/policy_evaluator.py", line 178, in __init__
    self.env = env_creator(env_context)
  File "w.py", line 36, in <lambda>
    register_env(env_name, lambda _ : getGibsonEnv())
  File "w.py", line 29, in getGibsonEnv
    config=config_file)
  File "/root/mount/gibson/gibson/envs/husky_env.py", line 40, in __init__
    self.robot_introduce(Husky(self.config, env=self))
  File "/root/mount/gibson/gibson/envs/env_modalities.py", line 349, in robot_introduce
    self.setup_rendering_camera()
  File "/root/mount/gibson/gibson/envs/env_modalities.py", line 376, in setup_rendering_camera
    self.setup_camera_pc()
  File "/root/mount/gibson/gibson/envs/env_modalities.py", line 636, in setup_camera_pc
    env = self)
  File "/root/mount/gibson/gibson/core/render/pcrender.py", line 172, in __init__
    comp = torch.nn.DataParallel(comp).cuda()
  File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 258, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 185, in _apply
    module._apply(fn)
  File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 185, in _apply
    module._apply(fn)
  File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 185, in _apply
    module._apply(fn)
  File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 191, in _apply
    param.data = fn(param.data)
  File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 258, in <lambda>
    return self._apply(lambda t: t.cuda(device))
RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:74

Remote function set_global_vars failed with:

Traceback (most recent call last):
  File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 923, in _process_task
    self.reraise_actor_init_error()
  File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 267, in reraise_actor_init_error
    raise self.actor_init_error
  File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 945, in _process_task
    *arguments)
  File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/actor.py", line 261, in actor_method_executor
    method_returns = method(actor, *args)
  File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/evaluation/policy_evaluator.py", line 178, in __init__
    self.env = env_creator(env_context)
  File "w.py", line 36, in <lambda>
    register_env(env_name, lambda _ : getGibsonEnv())
  File "w.py", line 29, in getGibsonEnv
    config=config_file)
  File "/root/mount/gibson/gibson/envs/husky_env.py", line 40, in __init__
    self.robot_introduce(Husky(self.config, env=self))
  File "/root/mount/gibson/gibson/envs/env_modalities.py", line 349, in robot_introduce
    self.setup_rendering_camera()
  File "/root/mount/gibson/gibson/envs/env_modalities.py", line 376, in setup_rendering_camera
    self.setup_camera_pc()
  File "/root/mount/gibson/gibson/envs/env_modalities.py", line 636, in setup_camera_pc
    env = self)
  File "/root/mount/gibson/gibson/core/render/pcrender.py", line 172, in __init__
    comp = torch.nn.DataParallel(comp).cuda()
  File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 258, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 185, in _apply
    module._apply(fn)
  File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 185, in _apply
    module._apply(fn)
  File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 185, in _apply
    module._apply(fn)
  File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 191, in _apply
    param.data = fn(param.data)
  File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 258, in <lambda>
    return self._apply(lambda t: t.cuda(device))
RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:74

killing <subprocess.Popen object at 0x7f97880d22b0>
   File "w.py", line 68, in <module>
   File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/agents/agent.py", line 233, in train
   File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/utils/filter_manager.py", line 25, in synchronize
Remote function get_filters failed with:

Traceback (most recent call last):
  File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 923, in _process_task
    self.reraise_actor_init_error()
  File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 267, in reraise_actor_init_error
    raise self.actor_init_error
  File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 923, in _process_task
    self.reraise_actor_init_error()
  File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 267, in reraise_actor_init_error
    raise self.actor_init_error
  File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 945, in _process_task
    *arguments)
  File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/actor.py", line 261, in actor_method_executor
    method_returns = method(actor, *args)
  File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/evaluation/policy_evaluator.py", line 178, in __init__
    self.env = env_creator(env_context)
  File "w.py", line 36, in <lambda>
    register_env(env_name, lambda _ : getGibsonEnv())
  File "w.py", line 29, in getGibsonEnv
    config=config_file)
  File "/root/mount/gibson/gibson/envs/husky_env.py", line 40, in __init__
    self.robot_introduce(Husky(self.config, env=self))
  File "/root/mount/gibson/gibson/envs/env_modalities.py", line 349, in robot_introduce
    self.setup_rendering_camera()
  File "/root/mount/gibson/gibson/envs/env_modalities.py", line 376, in setup_rendering_camera
    self.setup_camera_pc()
  File "/root/mount/gibson/gibson/envs/env_modalities.py", line 636, in setup_camera_pc
    env = self)
  File "/root/mount/gibson/gibson/core/render/pcrender.py", line 172, in __init__
    comp = torch.nn.DataParallel(comp).cuda()
  File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 258, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 185, in _apply
    module._apply(fn)
  File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 185, in _apply
    module._apply(fn)
  File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 185, in _apply
    module._apply(fn)
  File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 191, in _apply
    param.data = fn(param.data)
  File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 258, in <lambda>
    return self._apply(lambda t: t.cuda(device))
RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:74
   File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 2514, in get

 RayGetError: Could not get objectid ObjectID(4a7d420ef7de86cb813dcb59e2ebc4ece375f9d7). It was created by remote function get_filters which failed with:

Remote function get_filters failed with:

Traceback (most recent call last):
  File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 923, in _process_task
    self.reraise_actor_init_error()
  File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 267, in reraise_actor_init_error
    raise self.actor_init_error
  File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 923, in _process_task
    self.reraise_actor_init_error()
  File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 267, in reraise_actor_init_error
    raise self.actor_init_error
  File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 945, in _process_task
    *arguments)
  File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/actor.py", line 261, in actor_method_executor
    method_returns = method(actor, *args)
  File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/evaluation/policy_evaluator.py", line 178, in __init__
    self.env = env_creator(env_context)
  File "w.py", line 36, in <lambda>
    register_env(env_name, lambda _ : getGibsonEnv())
  File "w.py", line 29, in getGibsonEnv
    config=config_file)
  File "/root/mount/gibson/gibson/envs/husky_env.py", line 40, in __init__
    self.robot_introduce(Husky(self.config, env=self))
  File "/root/mount/gibson/gibson/envs/env_modalities.py", line 349, in robot_introduce
    self.setup_rendering_camera()
  File "/root/mount/gibson/gibson/envs/env_modalities.py", line 376, in setup_rendering_camera
    self.setup_camera_pc()
  File "/root/mount/gibson/gibson/envs/env_modalities.py", line 636, in setup_camera_pc
    env = self)
  File "/root/mount/gibson/gibson/core/render/pcrender.py", line 172, in __init__
    comp = torch.nn.DataParallel(comp).cuda()
  File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 258, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 185, in _apply
    module._apply(fn)
  File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 185, in _apply
    module._apply(fn)
  File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 185, in _apply
    module._apply(fn)
  File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 191, in _apply
    param.data = fn(param.data)
  File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 258, in <lambda>
    return self._apply(lambda t: t.cuda(device))
RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:74

I1107 09:50:22.214844  9899 local_scheduler.cc:178] Killed worker pid 13341 which hadn't started yet.

The text was updated successfully, but these errors were encountered:

ericl · 2018-11-07T17:29:41Z

Hey @jhpenger , this is because by default we use CPUs only for policy evaluation. Is it necessary to allocate GPUs for the Gibson env to run?

That said, you can allocate GPUs for workers too by setting this conf: https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/ppo/ppo.py#L53
This should work with a fraction too.

Alternatively you can set num_workers: 0 and then the env will be on the driver only and sharing the GPUs allocated via the num_gpus conf.

jhpenger · 2018-11-07T22:34:57Z

@ericl num_workers: 0 worked.
How do I set policy evaluation to use only CPUs?

"Is it necessary to allocate GPUs for the Gibson env to run?"

needs GPU to render the environment on creation, and I believe it needs them to run as well

After changing num_workers: 0, I now get the following :

killing <subprocess.Popen object at 0x7fe4dd2fc9e8>
   File "ray_husky.py", line 68, in <module>
   File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/agents/agent.py", line 235, in train
   File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/tune/trainable.py", line 143, in train
   File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/agents/ppo/ppo.py", line 123, in _train
   File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/optimizers/multi_gpu_optimizer.py", line 104, in step
   File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/evaluation/policy_evaluator.py", line 303, in sample
   File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/evaluation/sampler.py", line 58, in get_data
   File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/evaluation/sampler.py", line 279, in _env_runner
   File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/utils/filter.py", line 217, in __call__
   File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/utils/filter.py", line 78, in push
 AssertionError: x.shape = (), self.shape = (128, 128, 4)

ericl · 2018-11-07T22:47:03Z

How do I set policy evaluation to use only CPUs?

It sounds like in your case this won't work, since policy evaluation will create a copy of your environment. So you need to allocate GPUs via num_gpus_per_worker: N where N could be 1 or a fraction.

 AssertionError: x.shape = (), self.shape = (128, 128, 4)

This means that your env is returning a scalar observation when it expected a shape of (128, 128, 4). Maybe check your env step/reset() return values, and also that env.observation_space.contains(obs) is true for the obs you return?

jhpenger · 2018-11-10T05:15:25Z

Thanks a lot, that helped.
Got it working now (Gibson returns a dictionary instead of array for their step() & reset(), since it returns multiple observations per step: depth field, rgb, etc...)

@ericl
Want some clarification on num_workers. Does num_workers: n mean n additional workers instead of n workers in total? because it seems to spin up n + 1 workers.
I want to train with exactly 1 worker right now since I'm testing on a machine with only 1 GPU and each Gibson Env requires a GPU. But num_workers = 1 creates worker0 and worker1 causing that cuda error. I'm just a bit confused about what num_workers: 0 is, since it trains fine. Is it equal to "my idea" of 1 worker?

I will work on this more in a few days, might have more questions

ericl · 2018-11-10T05:33:32Z

n workers total. You're probably seeing the additional CPU used for the driver, which is its separate process.
If you want to train with 1 env, you should use num_workers: 0. Some algorithms like A3C require a positive number of workers but others like PPO are fine with 0. In that case RLlib runs in a single process only.

ericl · 2018-11-10T05:39:20Z

Btw, could you share more details on how you were able to get AssertionError: x.shape = ()? I'd like to add a better warning for that.

Edit: Actually, I think this should be fixed in master, since we now support DictSpace.

jhpenger · 2018-11-17T23:36:39Z

@ericl Sorry for the late response. I fixed it by manually changing Gibson Environment's output from dict to Array.

Edit: Actually, I think this should be fixed in master, since we now support DictSpace.

It would be great that the current Ray supports DictSpace. How recently was this added? Because this wasn't available in the version of Ray I was running.
Have not tested it with newest Ray version yet, since I am getting some errors, which I had not looked into, even after rolling back the redis version to fix the compatibility issue caused by PR3333

ericl · 2018-11-18T00:18:54Z

@jhpenger as of #3051 (master)

jhpenger · 2018-11-24T23:48:04Z

@ericl I think I know what the problem was before. I think, in the older older ray version, gpu_fraction only sets the gpu resource for the driver. And I think alg = ppo.PPOAgent(config=config, env=env_name) spins up all the environments in parallel; so, the driver was using fractional GPU while the workers were each using an entire GPU. that's why I was getting a CUDA no device.

I was able to run multiple Gibson Env with remote functions when I specify appropriate fractional GPU resources. *

I'm trying to use xray right now, which has num_gpus and num_gpus_poer_worker.
Although the documentation says gpu_fraction is deprecated and I can now set fractional GPU resources under num_gpus, I am getting an error saying that num_gpus must be interger.
How do I fix this?

killing <subprocess.Popen object at 0x7f349127aa58>
   File "test.py", line 68, in <module>
   File "/root/ray/python/ray/rllib/agents/agent.py", line 297, in __init__
   File "/root/ray/python/ray/tune/trainable.py", line 87, in __init__
   File "/root/ray/python/ray/rllib/agents/agent.py", line 344, in _setup
   File "/root/ray/python/ray/rllib/agents/ppo/ppo.py", line 85, in _init
   File "/root/ray/python/ray/rllib/optimizers/policy_optimizer.py", line 54, in __init__
   File "/root/ray/python/ray/rllib/optimizers/multi_gpu_optimizer.py", line 47, in _init
 TypeError: 'float' object cannot be interpreted as an integer

jhpenger · 2018-11-25T00:20:50Z

#3394

jhpenger · 2018-11-26T06:17:17Z

@ericl btw, I finally got around to test if ray accepts dictionary of observations, it doesn't.
The updated error message is better for sure though, outputs the entire dictionary that ray is not accepting.
I'm using your frac_ppo branch version of ray.

ericl · 2018-11-26T07:01:16Z

Can you post your script? Try following the examples in this test: https://github.com/ray-project/ray/blob/master/python/ray/rllib/test/test_nested_spaces.py

…

On Sun, Nov 25, 2018, 10:17 PM jhpenger ***@***.***> wrote: @ericl <https://github.com/ericl> btw, I finally got around to test if ray accepts dictionary of observations, it doesn't. The updated error message is better for sure though, outputs the entire dictionary that ray is not accepting. I'm using your frac_ppo branch version of ray. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3265 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAA6StJ2xBGRRYKp1WXp5SJPc0tSE_E-ks5uy4d_gaJpZM4YSN0g> .

ericl · 2018-11-26T07:04:07Z

Note in particular that you have to implement _build_layers_*v2*.

…

On Sun, Nov 25, 2018, 11:01 PM Eric Liang ***@***.***> wrote: Can you post your script? Try following the examples in this test: https://github.com/ray-project/ray/blob/master/python/ray/rllib/test/test_nested_spaces.py On Sun, Nov 25, 2018, 10:17 PM jhpenger ***@***.***> wrote: > @ericl <https://github.com/ericl> btw, I finally got around to test if > ray accepts dictionary of observations, it doesn't. > The updated error message is better for sure though, outputs the entire > dictionary that ray is not accepting. > I'm using your frac_ppo branch version of ray. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#3265 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAA6StJ2xBGRRYKp1WXp5SJPc0tSE_E-ks5uy4d_gaJpZM4YSN0g> > . >

bmazoure · 2019-07-03T23:38:54Z

I don't know if the issue has been completely solved, but since it is marked as open, I will write here. The following command rllib train --run PG --env CartPole-v0 --config='{"use_pytorch":true,"num_gpus":1,"num_workers":0}', when executed, returns an no CUDA-capable device is detected error. Here is the detailed error:

2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002: Traceback (most recent call last):
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/workspace/ray/python/ray/tune/trial_runner.py", line 443, in _process_trial
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     result = self.trial_executor.fetch_result(trial)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/workspace/ray/python/ray/tune/ray_trial_executor.py", line 315, in fetch_result
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     result = ray.get(trial_future[0])
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/workspace/ray/python/ray/worker.py", line 2192, in get
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     raise value
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002: ray.exceptions.RayTaskError: �[36mray_worker�[39m (pid=211, host=container-e559-1557431881457-79101-01-000002)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/workspace/ray/python/ray/rllib/agents/trainer.py", line 293, in __init__
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     Trainable.__init__(self, config, logger_creator)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/workspace/ray/python/ray/tune/trainable.py", line 88, in __init__
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     self._setup(copy.deepcopy(self.config))
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/workspace/ray/python/ray/rllib/agents/trainer.py", line 393, in _setup
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     self._init(self.config, self.env_creator)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/workspace/ray/python/ray/rllib/agents/pg/pg.py", line 45, in _init
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     env_creator, policy_cls)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/workspace/ray/python/ray/rllib/agents/trainer.py", line 591, in make_local_evaluator
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     extra_config or {}))
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/workspace/ray/python/ray/rllib/agents/trainer.py", line 810, in _make_evaluator
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     _fake_sampler=config.get("_fake_sampler", False))
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/workspace/ray/python/ray/rllib/evaluation/policy_evaluator.py", line 324, in __init__
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     policy_dict, policy_config)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/workspace/ray/python/ray/rllib/evaluation/policy_evaluator.py", line 728, in _build_policy_map
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     policy_map[name] = cls(obs_space, act_space, merged_conf)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/workspace/ray/python/ray/rllib/agents/pg/torch_pg_policy_graph.py", line 69, in __init__
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     action_distribution_cls=dist_class)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/workspace/ray/python/ray/rllib/evaluation/torch_policy_graph.py", line 61, in __init__
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     self._model = model.to(self.device)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 379, in to
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     return self._apply(convert)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 185, in _apply
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     module._apply(fn)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 185, in _apply
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     module._apply(fn)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 185, in _apply
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     module._apply(fn)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   [Previous line repeated 1 more time]
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 191, in _apply
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     param.data = fn(param.data)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 377, in convert
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002: RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /opt/conda/conda-bld/pytorch_1532579805626/work/aten/src/THC/THCGeneral.cpp:74
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:

If it changes anything, everything is running in a Docker container. The TensorFlow version (use_pytorch:false), as well as PyTorch code on CPU work well.
Does anyone have any idea about what might be hapenning? Thanks in advance

richardliaw · 2019-07-04T00:23:16Z

Is nvidia docker enabled?

bmazoure · 2019-07-04T00:55:17Z

@richardliaw I believe so, will double check a little later. The strange thing is that the tensorflow code trains fine. When I run print("Found %d CUDA devices"%torch.cuda.device_count()) in the code, it prints Found 2 CUDA devices if I give it 2 GPUs. On the other hand, if within a TorchPolicyGraph.__init__ I print torch.cuda.is_available(), it returns False.

ericl · 2019-07-04T16:26:11Z

Makes sense. Want to submit a patch?

…

On Thu, Jul 4, 2019, 7:17 AM Bogdan Mazoure ***@***.***> wrote: Update: I managed to solve the issue by overriding TorchPolicyGraph.__init__ and changing bool(os.environ.get("CUDA_VISIBLE_DEVICES", None) to torch.cuda.is_available(). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3265?email_source=notifications&email_token=AAADUSTZWOMXBWJTP5SLZSDP5YA6LA5CNFSM4GCI3UQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZHRBZY#issuecomment-508498151>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAADUSWSLGXVARP773UIT53P5YA6LANCNFSM4GCI3UQA> .

bmazoure · 2019-07-04T16:49:55Z

After investigating further, changing theself.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') does make the code run, but it still uses the CPU without errors. Any idea what could cause torch.cuda.is_available() to return True after ray.init, but to return False inside Trainer.__init__? I assume it has something to do with the GPU not being visible from inside Ray, which makes the model mapped onto CPU?

ericl · 2019-07-06T00:32:23Z

Ray will automatically set CUDA_VISIBLE_DEVICES inside the actor processes based on the gpu configuration. For example: rllib train --run PG --env CartPole-v0 --config='{"use_pytorch": true, "num_gpus": 1, "num_workers": 0} will allocate 1 gpu device (so CUDA_VISIBLE_DEVICES will probably be set to something like "0", and torch.cuda.is_available() will return True).

I just tried running that command with ray==0.7.1 and latest and see non-zero GPU utilization, is that different from what you're trying?

Note that, if num_workers > 0, then the gpus assigned to workers are controlled by "num_gpus_per_worker". Usually you don't want to assign GPUs to workers, since inference is efficient enough with CPUs. So the GPUs specified by num_gpus are only used for the learner. num_workers==0 is a special case where both inference and learning is done in the same process.

bmazoure · 2019-07-10T20:26:15Z

Yes, so the issue was that CUDA_VISIBLE_DEVICES was being unset from the environment (somehow). Putting os.environ('CUDA_VISIBLE_DEVICES') = '0' fixed the issue.
Thanks everyone!

cagbal · 2019-08-28T11:15:40Z

same here and setting CUDA_VISIBLE_DEVICES is not working. If I run the training script without ray, it's working fine.

sevro · 2019-10-03T15:40:02Z

Just ran into the same problem training a CNN with tune.run:

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): PopOS (Ubuntu 18.04)
Ray installed from (source or binary): pip
Ray version: 0.7.3
Python version: Python 3.7.3 :: Anaconda, Inc.

Everything was working, then I added validate_save_restore before calling run:

validate_save_restore(MyAgent, use_object_store=True, config={
    "args": config,
    "lr": 0.01,
    "momentum": 0.9,
    "weight_decay": 0.001,
    "step_size": 31,
    "gamma": 0.001
})

Which later causes torch.cuda.is_available() to return False in the tune.run workers. After removing it everything works again. Maybe that is what is causing it?

richardliaw · 2020-04-13T23:01:08Z

Closing this issue because it seems like this is working. Please reopen if not.

crypdick · 2020-05-13T22:26:14Z

@richardliaw I am seeing a similar issue with Ray serve on a p3.16xlarge EC2 instance. Looks like nccl, nvidia-smi, torch.cuda.device_count(), etc is working. I am using @simon-mo 's script here: https://gist.github.com/simon-mo/b5be0b95d6b79f27780d569073f5588a

I tried #3265 (comment) but it gave me SyntaxError: can't assign to function call

EDIT: solved by setting serve.create_backend(…,backend_config=serve.BackendConfig(…, num_gpus=1, num_replicas=num_total_gpus)

damvantai · 2020-11-04T01:48:43Z

Yes, so the issue was that CUDA_VISIBLE_DEVICES was being unset from the environment (somehow). Putting os.environ('CUDA_VISIBLE_DEVICES') = '0' fixed the issue.
Thanks everyone!

thank you very much!

ericl added the question Just a question :) label Nov 7, 2018

edoakes added the rllib label Mar 5, 2020

esquires mentioned this issue Mar 12, 2020

[rllib] custom torch model with torch.cuda.FloatTensor and torch.FloatTensor #7583

Closed

richardliaw closed this as completed Apr 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

no CUDA-capable device is detected #3265

no CUDA-capable device is detected #3265

jhpenger commented Nov 7, 2018

ericl commented Nov 7, 2018 •

edited

jhpenger commented Nov 7, 2018 •

edited

ericl commented Nov 7, 2018 •

edited

jhpenger commented Nov 10, 2018

ericl commented Nov 10, 2018

ericl commented Nov 10, 2018 •

edited

jhpenger commented Nov 17, 2018

ericl commented Nov 18, 2018

jhpenger commented Nov 24, 2018

jhpenger commented Nov 25, 2018

jhpenger commented Nov 26, 2018

ericl commented Nov 26, 2018 via email

ericl commented Nov 26, 2018 via email

bmazoure commented Jul 3, 2019 •

edited

richardliaw commented Jul 4, 2019

bmazoure commented Jul 4, 2019

ericl commented Jul 4, 2019 via email

bmazoure commented Jul 4, 2019 •

edited

ericl commented Jul 6, 2019 •

edited

bmazoure commented Jul 10, 2019

cagbal commented Aug 28, 2019

sevro commented Oct 3, 2019

richardliaw commented Apr 13, 2020

crypdick commented May 13, 2020 •

edited

damvantai commented Nov 4, 2020

no CUDA-capable device is detected #3265

no CUDA-capable device is detected #3265

Comments

jhpenger commented Nov 7, 2018

System information

Describe the problem

nvidia-smi

torch.cuda.device_count()

nvcc --version

To Reproduce

Get Nvidia-Docker2

Download Gibson's dataset

Pull Gibson's image

Run it in Docker

Add in the ray_husky.py script

Full Log

ericl commented Nov 7, 2018 • edited

jhpenger commented Nov 7, 2018 • edited

ericl commented Nov 7, 2018 • edited

jhpenger commented Nov 10, 2018

ericl commented Nov 10, 2018

ericl commented Nov 10, 2018 • edited

jhpenger commented Nov 17, 2018

ericl commented Nov 18, 2018

jhpenger commented Nov 24, 2018

jhpenger commented Nov 25, 2018

jhpenger commented Nov 26, 2018

ericl commented Nov 26, 2018 via email

ericl commented Nov 26, 2018 via email

bmazoure commented Jul 3, 2019 • edited

richardliaw commented Jul 4, 2019

bmazoure commented Jul 4, 2019

ericl commented Jul 4, 2019 via email

bmazoure commented Jul 4, 2019 • edited

ericl commented Jul 6, 2019 • edited

bmazoure commented Jul 10, 2019

cagbal commented Aug 28, 2019

sevro commented Oct 3, 2019

richardliaw commented Apr 13, 2020

crypdick commented May 13, 2020 • edited

damvantai commented Nov 4, 2020

ericl commented Nov 7, 2018 •

edited

jhpenger commented Nov 7, 2018 •

edited

ericl commented Nov 7, 2018 •

edited

ericl commented Nov 10, 2018 •

edited

bmazoure commented Jul 3, 2019 •

edited

bmazoure commented Jul 4, 2019 •

edited

ericl commented Jul 6, 2019 •

edited

crypdick commented May 13, 2020 •

edited