Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_addmm) #21921

Closed
1 of 2 tasks
avacaondata opened this issue Jan 27, 2022 · 10 comments
Labels
bug Something that is supposed to be working; but isn't multiple-reports multiple reports of this issue P2 Important issue, but not time-critical rllib RLlib related issues rllib-connector Connector related issues

Comments

@avacaondata
Copy link

avacaondata commented Jan 27, 2022

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Tune, RLlib

What happened + What you expected to happen

When trying a RLLib experiment following these guidelines: https://www.tensortrade.org/en/latest/examples/train_and_evaluate_using_ray.html
with this config:
tune.run( run_or_experiment="PPO", # We'll be using the builtin PPO agent in RLLib name="MyExperiment1", metric='episode_reward_mean', mode='max', # resources_per_trial= {"cpu": 8, "gpu": 1}, stop={ "training_iteration": 100 # Let's do 5 steps for each hyperparameter combination }, config={ "env": "MyTrainingEnv", "env_config": config_train, # The dictionary we built before "log_level": "WARNING", "framework": "torch", "_fake_gpus": False, "ignore_worker_failures": True, "num_workers": 1, # One worker per agent. You can increase this but it will run fewer parallel trainings. "num_envs_per_worker": 1, "num_gpus": 1, # I yet have to understand if using a GPU is worth it, for our purposes, but I think it's not. This way you can train on a non-gpu enabled system. "clip_rewards": True, "lr": LEARNING_RATE, # Hyperparameter grid search defined above "gamma": GAMMA, # This can have a big impact on the result and needs to be properly tuned (range is 0 to 1) "lambda": LAMBDA, "observation_filter": "MeanStdFilter", "model": { "fcnet_hiddens": FC_SIZE, # Hyperparameter grid search defined above #"use_attention": True, #"attention_use_n_prev_actions": 120, #"attention_use_n_prev_rewards": 120 }, "sgd_minibatch_size": MINIBATCH_SIZE, # Hyperparameter grid search defined above "evaluation_interval": 1, # Run evaluation on every iteration "evaluation_config": { "env_config": config_eval, # The dictionary we built before (only the overriding keys to use in evaluation) "explore": False, # We don't want to explore during evaluation. All actions have to be repeatable. }, }, num_samples=1, # Have one sample for each hyperparameter combination. You can have more to average out randomness. keep_checkpoints_num=3, # Keep the last 2 checkpoints checkpoint_freq=1, # Do a checkpoint on each iteration (slower but you can pick more finely the checkpoint to use later) local_dir=r"D:\ray_results" )
I encountered the following error:

Traceback (most recent call last):
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\tune\trial_runner.py", line 886, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\tune\ray_trial_executor.py", line 675, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\_private\client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\worker.py", line 1760, in get
    raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, �[36mray::PPOTrainer.__init__()�[39m (pid=12840, ip=127.0.0.1, repr=PPOTrainer)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\util\tracing\tracing_helper.py", line 451, in _resume_span
    return method(self, *_args, **_kwargs)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\agents\trainer.py", line 948, in _init
    raise NotImplementedError
NotImplementedError

During handling of the above exception, another exception occurred:

�[36mray::PPOTrainer.__init__()�[39m (pid=12840, ip=127.0.0.1, repr=PPOTrainer)
  File "python\ray\_raylet.pyx", line 633, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 674, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 640, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 644, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 593, in ray._raylet.execute_task.function_executor
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\_private\function_manager.py", line 648, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\util\tracing\tracing_helper.py", line 451, in _resume_span
    return method(self, *_args, **_kwargs)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\agents\trainer.py", line 741, in __init__
    super().__init__(config, logger_creator, remote_checkpoint_dir,
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\tune\trainable.py", line 124, in __init__
    self.setup(copy.deepcopy(self.config))
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\util\tracing\tracing_helper.py", line 451, in _resume_span
    return method(self, *_args, **_kwargs)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\agents\trainer.py", line 846, in setup
    self.workers = self._make_workers(
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\util\tracing\tracing_helper.py", line 451, in _resume_span
    return method(self, *_args, **_kwargs)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\agents\trainer.py", line 1971, in _make_workers
    return WorkerSet(
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\evaluation\worker_set.py", line 123, in __init__
    self._local_worker = self._make_worker(
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\evaluation\worker_set.py", line 499, in _make_worker
    worker = cls(
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 586, in __init__
    self._build_policy_map(
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 1569, in _build_policy_map
    self.policy_map.create_policy(name, orig_cls, obs_space, act_space,
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\policy\policy_map.py", line 143, in create_policy
    self[policy_id] = class_(observation_space, action_space,
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\agents\ppo\ppo_torch_policy.py", line 50, in __init__
    self._initialize_loss_from_dummy_batch()
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\policy\policy.py", line 832, in _initialize_loss_from_dummy_batch
    self.compute_actions_from_input_dict(
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\policy\torch_policy.py", line 294, in compute_actions_from_input_dict
    return self._compute_action_helper(input_dict, state_batches,
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\utils\threading.py", line 21, in wrapper
    return func(self, *a, **k)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\policy\torch_policy.py", line 934, in _compute_action_helper
    dist_inputs, state_out = self.model(input_dict, state_batches,
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\models\modelv2.py", line 243, in __call__
    res = self.forward(restored, state or [], seq_lens)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\models\torch\complex_input_net.py", line 193, in forward
    nn_out, _ = self.flatten[i](SampleBatch({
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\models\modelv2.py", line 243, in __call__
    res = self.forward(restored, state or [], seq_lens)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\models\torch\fcnet.py", line 124, in forward
    self._features = self._hidden_layers(self._last_flat_in)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\torch\nn\modules\container.py", line 141, in forward
    input = module(input)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\models\torch\misc.py", line 160, in forward
    return self._model(x)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\torch\nn\modules\container.py", line 141, in forward
    input = module(input)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\torch\nn\modules\linear.py", line 103, in forward
    return F.linear(input, self.weight, self.bias)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\torch\nn\functional.py", line 1849, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_addmm)

I would expect tensors to be placed on the same device.

Versions / Dependencies

OS: Windows 10
Ray: 2.0.0.dev0
Python: 3.8
Torch: 1.10.1
CUDA: 11.4

Reproduction script

https://www.tensortrade.org/en/latest/examples/train_and_evaluate_using_ray.html

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@avacaondata avacaondata added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 27, 2022
@easysoft2k15
Copy link

@alexvaca0,
I'm running into the same problem. Did You manage to solve it somehow?
Thank You

@sadimoodi
Copy link

same problem here, any solution yet?

@easysoft2k15
Copy link

I'm not sure yet, but I think the reasons I get this error is because my (custom) environment use a multi dimensional observations space:

self.observation_space=Box(-1.0,1.0,(8,7))

Moving to stable-baselines3, I discovered that most of the algorithms use in rl support only flatten spaces (in the library there's a check_env utility function).

I have modify my environment accordingly and in stable-baselines3 now it work just fine.

I suspect that if I test it on ray it will work just as well but I have not tested it yet.

I hope this may help You

@sadimoodi
Copy link

I'm not sure yet, but I think the reasons I get this error is because my (custom) environment use a multi dimensional observations space:

self.observation_space=Box(-1.0,1.0,(8,7))

Moving to stable-baselines3, I discovered that most of the algorithms use in rl support only flatten spaces (in the library there's a check_env utility function).

I have modify my environment accordingly and in stable-baselines3 now it work just fine.

I suspect that if I test it on ray it will work just as well but I have not tested it yet.

I hope this may help You

@easysoft2k15 you are partially right, in my experiments: Ray does not work with multi dimentional observation spaces, unless you use "conv_filters", as per documentaiton here:
image
the bug we see here is due to Torch moving tensors from GPU to CPU which is causing the issue when you train on CPU and GPU, so when i disabled GPU and trained only on CPU all went well.

@evo11x
Copy link

evo11x commented Feb 18, 2022

thanks! flattening the observation fixed the problem

@smorad
Copy link
Contributor

smorad commented Mar 4, 2022

Yeah, it seems if you have any observations like gym.spaces.Box(shape=(2, 1)), you will get this error. Making this gym.spaces.Box(shape=(2,)) fixes the problem. IMO this is a very confusing bug for a common use case. Why does the observation space mess with the underlying torch.device? @sven1977 maybe we should assert that all spaces are flattened or something?

@bhavithran1
Copy link

This problem was solved for me by pip install ray[default,tune,rllib,serve]==1.9.2

Hope it helps!

@krfricke krfricke added the rllib RLlib related issues label Apr 4, 2022
@gjoliver gjoliver added P2 Important issue, but not time-critical rllib-connector Connector related issues multiple-reports multiple reports of this issue and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 9, 2022
@michaelfeil
Copy link

michaelfeil commented May 9, 2022

This problem was solved for me by pip install ray[default,tune,rllib,serve]==1.9.2

Hope it helps!

This works, it seems like the built-in ModelV2 had some problems with non-flat observations. In my case, I also got the same error, as I forgot to define the custom_model (-> fallback to the built-in model) in the trainer config. There are a couple of solutions: defining a custom model, which flattens the input space or can handle your multidimensional observations, or writing a Gym.Wrapper for flattening observations.

@dan-1d
Copy link

dan-1d commented Aug 17, 2022

This is a real bug in ray/rllib/models/torch/complex_input_net.py, and has been fixed in the master branch. I independently made the same changes as this commit, and they fixed my problem.

a598458

The problem inComplexInputNetwork was that the Torch sub-modules for "one-hot" and "flatten" were not all being registered, so their parameters were not moved to GPU.

@timf34
Copy link

timf34 commented Aug 27, 2022

the bug we see here is due to Torch moving tensors from GPU to CPU which is causing the issue when you train on CPU and GPU, so when i disabled GPU and trained only on CPU all went well.

Not using the GPU won't work for me unfortunately (need it for speed)... is there any fix for this which includes being able to use the GPU?
Do I need to flatten the observations... if so, how do I do this? Do I flatten them before they're fed to the network... how does this work when I am using CNNs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't multiple-reports multiple reports of this issue P2 Important issue, but not time-critical rllib RLlib related issues rllib-connector Connector related issues
Projects
None yet
Development

No branches or pull requests