-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
no CUDA-capable device is detected #3265
Comments
Hey @jhpenger , this is because by default we use CPUs only for policy evaluation. Is it necessary to allocate GPUs for the Gibson env to run? That said, you can allocate GPUs for workers too by setting this conf: https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/ppo/ppo.py#L53 Alternatively you can set num_workers: 0 and then the env will be on the driver only and sharing the GPUs allocated via the num_gpus conf. |
@ericl
needs GPU to render the environment on creation, and I believe it needs them to run as well After changing
|
It sounds like in your case this won't work, since policy evaluation will create a copy of your environment. So you need to allocate GPUs via
This means that your env is returning a scalar observation when it expected a shape of (128, 128, 4). Maybe check your env step/reset() return values, and also that env.observation_space.contains(obs) is true for the obs you return? |
Thanks a lot, that helped. @ericl I will work on this more in a few days, might have more questions |
|
Btw, could you share more details on how you were able to get Edit: Actually, I think this should be fixed in master, since we now support DictSpace. |
@ericl Sorry for the late response. I fixed it by manually changing Gibson Environment's output from
It would be great that the current Ray supports DictSpace. How recently was this added? Because this wasn't available in the version of Ray I was running. |
@ericl I think I know what the problem was before. I think, in the older older ray version,
I'm trying to use xray right now, which has
|
@ericl btw, I finally got around to test if ray accepts dictionary of observations, it doesn't. |
Can you post your script? Try following the examples in this test:
https://github.com/ray-project/ray/blob/master/python/ray/rllib/test/test_nested_spaces.py
…On Sun, Nov 25, 2018, 10:17 PM jhpenger ***@***.***> wrote:
@ericl <https://github.com/ericl> btw, I finally got around to test if
ray accepts dictionary of observations, it doesn't.
The updated error message is better for sure though, outputs the entire
dictionary that ray is not accepting.
I'm using your frac_ppo branch version of ray.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3265 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAA6StJ2xBGRRYKp1WXp5SJPc0tSE_E-ks5uy4d_gaJpZM4YSN0g>
.
|
Note in particular that you have to implement _build_layers_*v2*.
…On Sun, Nov 25, 2018, 11:01 PM Eric Liang ***@***.***> wrote:
Can you post your script? Try following the examples in this test:
https://github.com/ray-project/ray/blob/master/python/ray/rllib/test/test_nested_spaces.py
On Sun, Nov 25, 2018, 10:17 PM jhpenger ***@***.***> wrote:
> @ericl <https://github.com/ericl> btw, I finally got around to test if
> ray accepts dictionary of observations, it doesn't.
> The updated error message is better for sure though, outputs the entire
> dictionary that ray is not accepting.
> I'm using your frac_ppo branch version of ray.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#3265 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAA6StJ2xBGRRYKp1WXp5SJPc0tSE_E-ks5uy4d_gaJpZM4YSN0g>
> .
>
|
I don't know if the issue has been completely solved, but since it is marked as open, I will write here. The following command
If it changes anything, everything is running in a Docker container. The TensorFlow version ( |
Is nvidia docker enabled? |
@richardliaw I believe so, will double check a little later. The strange thing is that the tensorflow code trains fine. When I run |
Makes sense. Want to submit a patch?
…On Thu, Jul 4, 2019, 7:17 AM Bogdan Mazoure ***@***.***> wrote:
Update: I managed to solve the issue by overriding
TorchPolicyGraph.__init__ and changing bool(os.environ.get("CUDA_VISIBLE_DEVICES",
None) to torch.cuda.is_available().
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3265?email_source=notifications&email_token=AAADUSTZWOMXBWJTP5SLZSDP5YA6LA5CNFSM4GCI3UQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZHRBZY#issuecomment-508498151>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAADUSWSLGXVARP773UIT53P5YA6LANCNFSM4GCI3UQA>
.
|
After investigating further, changing the |
Ray will automatically set CUDA_VISIBLE_DEVICES inside the actor processes based on the gpu configuration. For example: I just tried running that command with ray==0.7.1 and latest and see non-zero GPU utilization, is that different from what you're trying? Note that, if num_workers > 0, then the gpus assigned to workers are controlled by "num_gpus_per_worker". Usually you don't want to assign GPUs to workers, since inference is efficient enough with CPUs. So the GPUs specified by num_gpus are only used for the learner. num_workers==0 is a special case where both inference and learning is done in the same process. |
Yes, so the issue was that |
same here and setting CUDA_VISIBLE_DEVICES is not working. If I run the training script without ray, it's working fine. |
Just ran into the same problem training a CNN with
Everything was working, then I added validate_save_restore(MyAgent, use_object_store=True, config={
"args": config,
"lr": 0.01,
"momentum": 0.9,
"weight_decay": 0.001,
"step_size": 31,
"gamma": 0.001
}) Which later causes |
Closing this issue because it seems like this is working. Please reopen if not. |
@richardliaw I am seeing a similar issue with Ray serve on a p3.16xlarge EC2 instance. Looks like nccl, nvidia-smi, torch.cuda.device_count(), etc is working. I am using @simon-mo 's script here: https://gist.github.com/simon-mo/b5be0b95d6b79f27780d569073f5588a I tried #3265 (comment) but it gave me EDIT: solved by setting |
thank you very much! |
System information
Describe the problem
Trying to setup a rllib ppo agent with
husky_env
fromGibson Env
The script I ran can be found here
I am getting the following Error when calling
agent.train()
:Gibson does the environment rendering upon environment creation, and
rllib
agent's seems to invokeenv_creator
every timetrain()
is called. I originally thought that was the issue but I don't think it is the caseI tried using
gpu_fraction
, didn't work. Not sure what is causing the problem.nvidia-smi
torch.cuda.device_count()
nvcc --version
To Reproduce
Get Nvidia-Docker2
https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)
Download Gibson's dataset
Pull Gibson's image
Run it in Docker
replace
<dataset-absolute-path>
with the absolute path to the Gibson dataset you've unzipped on your local machineAdd in the ray_husky.py script
Copy the
ray_husky.py
found here to~/mount/gibson/examples/train/
directory in the docker container.Run:
python ray_husky.py
Full Log
The text was updated successfully, but these errors were encountered: