[Bug] RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_addmm) #21921

avacaondata · 2022-01-27T01:04:45Z

Search before asking

I searched the issues and found no similar issues.

Ray Component

Ray Tune, RLlib

What happened + What you expected to happen

When trying a RLLib experiment following these guidelines: https://www.tensortrade.org/en/latest/examples/train_and_evaluate_using_ray.html
with this config:
tune.run( run_or_experiment="PPO", # We'll be using the builtin PPO agent in RLLib name="MyExperiment1", metric='episode_reward_mean', mode='max', # resources_per_trial= {"cpu": 8, "gpu": 1}, stop={ "training_iteration": 100 # Let's do 5 steps for each hyperparameter combination }, config={ "env": "MyTrainingEnv", "env_config": config_train, # The dictionary we built before "log_level": "WARNING", "framework": "torch", "_fake_gpus": False, "ignore_worker_failures": True, "num_workers": 1, # One worker per agent. You can increase this but it will run fewer parallel trainings. "num_envs_per_worker": 1, "num_gpus": 1, # I yet have to understand if using a GPU is worth it, for our purposes, but I think it's not. This way you can train on a non-gpu enabled system. "clip_rewards": True, "lr": LEARNING_RATE, # Hyperparameter grid search defined above "gamma": GAMMA, # This can have a big impact on the result and needs to be properly tuned (range is 0 to 1) "lambda": LAMBDA, "observation_filter": "MeanStdFilter", "model": { "fcnet_hiddens": FC_SIZE, # Hyperparameter grid search defined above #"use_attention": True, #"attention_use_n_prev_actions": 120, #"attention_use_n_prev_rewards": 120 }, "sgd_minibatch_size": MINIBATCH_SIZE, # Hyperparameter grid search defined above "evaluation_interval": 1, # Run evaluation on every iteration "evaluation_config": { "env_config": config_eval, # The dictionary we built before (only the overriding keys to use in evaluation) "explore": False, # We don't want to explore during evaluation. All actions have to be repeatable. }, }, num_samples=1, # Have one sample for each hyperparameter combination. You can have more to average out randomness. keep_checkpoints_num=3, # Keep the last 2 checkpoints checkpoint_freq=1, # Do a checkpoint on each iteration (slower but you can pick more finely the checkpoint to use later) local_dir=r"D:\ray_results" )
I encountered the following error:

Traceback (most recent call last):
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\tune\trial_runner.py", line 886, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\tune\ray_trial_executor.py", line 675, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\_private\client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\worker.py", line 1760, in get
    raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, �[36mray::PPOTrainer.__init__()�[39m (pid=12840, ip=127.0.0.1, repr=PPOTrainer)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\util\tracing\tracing_helper.py", line 451, in _resume_span
    return method(self, *_args, **_kwargs)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\agents\trainer.py", line 948, in _init
    raise NotImplementedError
NotImplementedError

During handling of the above exception, another exception occurred:

�[36mray::PPOTrainer.__init__()�[39m (pid=12840, ip=127.0.0.1, repr=PPOTrainer)
  File "python\ray\_raylet.pyx", line 633, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 674, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 640, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 644, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 593, in ray._raylet.execute_task.function_executor
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\_private\function_manager.py", line 648, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\util\tracing\tracing_helper.py", line 451, in _resume_span
    return method(self, *_args, **_kwargs)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\agents\trainer.py", line 741, in __init__
    super().__init__(config, logger_creator, remote_checkpoint_dir,
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\tune\trainable.py", line 124, in __init__
    self.setup(copy.deepcopy(self.config))
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\util\tracing\tracing_helper.py", line 451, in _resume_span
    return method(self, *_args, **_kwargs)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\agents\trainer.py", line 846, in setup
    self.workers = self._make_workers(
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\util\tracing\tracing_helper.py", line 451, in _resume_span
    return method(self, *_args, **_kwargs)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\agents\trainer.py", line 1971, in _make_workers
    return WorkerSet(
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\evaluation\worker_set.py", line 123, in __init__
    self._local_worker = self._make_worker(
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\evaluation\worker_set.py", line 499, in _make_worker
    worker = cls(
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 586, in __init__
    self._build_policy_map(
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 1569, in _build_policy_map
    self.policy_map.create_policy(name, orig_cls, obs_space, act_space,
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\policy\policy_map.py", line 143, in create_policy
    self[policy_id] = class_(observation_space, action_space,
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\agents\ppo\ppo_torch_policy.py", line 50, in __init__
    self._initialize_loss_from_dummy_batch()
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\policy\policy.py", line 832, in _initialize_loss_from_dummy_batch
    self.compute_actions_from_input_dict(
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\policy\torch_policy.py", line 294, in compute_actions_from_input_dict
    return self._compute_action_helper(input_dict, state_batches,
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\utils\threading.py", line 21, in wrapper
    return func(self, *a, **k)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\policy\torch_policy.py", line 934, in _compute_action_helper
    dist_inputs, state_out = self.model(input_dict, state_batches,
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\models\modelv2.py", line 243, in __call__
    res = self.forward(restored, state or [], seq_lens)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\models\torch\complex_input_net.py", line 193, in forward
    nn_out, _ = self.flatten[i](SampleBatch({
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\models\modelv2.py", line 243, in __call__
    res = self.forward(restored, state or [], seq_lens)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\models\torch\fcnet.py", line 124, in forward
    self._features = self._hidden_layers(self._last_flat_in)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\torch\nn\modules\container.py", line 141, in forward
    input = module(input)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\models\torch\misc.py", line 160, in forward
    return self._model(x)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\torch\nn\modules\container.py", line 141, in forward
    input = module(input)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\torch\nn\modules\linear.py", line 103, in forward
    return F.linear(input, self.weight, self.bias)
  File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\torch\nn\functional.py", line 1849, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_addmm)

I would expect tensors to be placed on the same device.

Versions / Dependencies

OS: Windows 10
Ray: 2.0.0.dev0
Python: 3.8
Torch: 1.10.1
CUDA: 11.4

Reproduction script

https://www.tensortrade.org/en/latest/examples/train_and_evaluate_using_ray.html

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

easysoft2k15 · 2022-02-06T10:44:30Z

@alexvaca0,
I'm running into the same problem. Did You manage to solve it somehow?
Thank You

sadimoodi · 2022-02-11T12:59:06Z

same problem here, any solution yet?

easysoft2k15 · 2022-02-13T14:51:51Z

I'm not sure yet, but I think the reasons I get this error is because my (custom) environment use a multi dimensional observations space:

self.observation_space=Box(-1.0,1.0,(8,7))

Moving to stable-baselines3, I discovered that most of the algorithms use in rl support only flatten spaces (in the library there's a check_env utility function).

I have modify my environment accordingly and in stable-baselines3 now it work just fine.

I suspect that if I test it on ray it will work just as well but I have not tested it yet.

I hope this may help You

sadimoodi · 2022-02-14T11:40:18Z

I'm not sure yet, but I think the reasons I get this error is because my (custom) environment use a multi dimensional observations space:

self.observation_space=Box(-1.0,1.0,(8,7))

Moving to stable-baselines3, I discovered that most of the algorithms use in rl support only flatten spaces (in the library there's a check_env utility function).

I have modify my environment accordingly and in stable-baselines3 now it work just fine.

I suspect that if I test it on ray it will work just as well but I have not tested it yet.

I hope this may help You

@easysoft2k15 you are partially right, in my experiments: Ray does not work with multi dimentional observation spaces, unless you use "conv_filters", as per documentaiton here:

the bug we see here is due to Torch moving tensors from GPU to CPU which is causing the issue when you train on CPU and GPU, so when i disabled GPU and trained only on CPU all went well.

evo11x · 2022-02-18T17:50:07Z

thanks! flattening the observation fixed the problem

smorad · 2022-03-04T11:39:51Z

Yeah, it seems if you have any observations like gym.spaces.Box(shape=(2, 1)), you will get this error. Making this gym.spaces.Box(shape=(2,)) fixes the problem. IMO this is a very confusing bug for a common use case. Why does the observation space mess with the underlying torch.device? @sven1977 maybe we should assert that all spaces are flattened or something?

bhavithran1 · 2022-03-26T08:21:47Z

This problem was solved for me by pip install ray[default,tune,rllib,serve]==1.9.2

Hope it helps!

michaelfeil · 2022-05-09T16:59:33Z

This problem was solved for me by pip install ray[default,tune,rllib,serve]==1.9.2

Hope it helps!

This works, it seems like the built-in ModelV2 had some problems with non-flat observations. In my case, I also got the same error, as I forgot to define the custom_model (-> fallback to the built-in model) in the trainer config. There are a couple of solutions: defining a custom model, which flattens the input space or can handle your multidimensional observations, or writing a Gym.Wrapper for flattening observations.

dan-1d · 2022-08-17T22:31:56Z

This is a real bug in ray/rllib/models/torch/complex_input_net.py, and has been fixed in the master branch. I independently made the same changes as this commit, and they fixed my problem.

a598458

The problem inComplexInputNetwork was that the Torch sub-modules for "one-hot" and "flatten" were not all being registered, so their parameters were not moved to GPU.

timf34 · 2022-08-27T23:57:44Z

the bug we see here is due to Torch moving tensors from GPU to CPU which is causing the issue when you train on CPU and GPU, so when i disabled GPU and trained only on CPU all went well.

Not using the GPU won't work for me unfortunately (need it for speed)... is there any fix for this which includes being able to use the GPU?
Do I need to flatten the observations... if so, how do I do this? Do I flatten them before they're fed to the network... how does this work when I am using CNNs?

avacaondata added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 27, 2022

krfricke added the rllib RLlib related issues label Apr 4, 2022

gjoliver added P2 Important issue, but not time-critical rllib-connector Connector related issues multiple-reports multiple reports of this issue and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 9, 2022

kouroshHakha closed this as completed Aug 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_addmm) #21921

[Bug] RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_addmm) #21921

avacaondata commented Jan 27, 2022 •

edited

easysoft2k15 commented Feb 6, 2022

sadimoodi commented Feb 11, 2022

easysoft2k15 commented Feb 13, 2022

sadimoodi commented Feb 14, 2022

evo11x commented Feb 18, 2022

smorad commented Mar 4, 2022 •

edited

bhavithran1 commented Mar 26, 2022

michaelfeil commented May 9, 2022 •

edited

dan-1d commented Aug 17, 2022

timf34 commented Aug 27, 2022 •

edited

[Bug] RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_addmm) #21921

[Bug] RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_addmm) #21921

Comments

avacaondata commented Jan 27, 2022 • edited

Search before asking

Ray Component

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Anything else

Are you willing to submit a PR?

easysoft2k15 commented Feb 6, 2022

sadimoodi commented Feb 11, 2022

easysoft2k15 commented Feb 13, 2022

sadimoodi commented Feb 14, 2022

evo11x commented Feb 18, 2022

smorad commented Mar 4, 2022 • edited

bhavithran1 commented Mar 26, 2022

michaelfeil commented May 9, 2022 • edited

dan-1d commented Aug 17, 2022

timf34 commented Aug 27, 2022 • edited

avacaondata commented Jan 27, 2022 •

edited

smorad commented Mar 4, 2022 •

edited

michaelfeil commented May 9, 2022 •

edited

timf34 commented Aug 27, 2022 •

edited