ARM (and YARR) conflicts with current RLBench (1.2.0) #8

alexanderdurr · 2022-05-27T10:12:30Z

Hi,
can you help me and tell me which rlbench and yarr versions/tags are compatible with each other?
For most of the problems I believe that pytorch is the issue and I don't find in any requirements.txt which one you use to make things work.

I observe this error

Process train_env0:
Traceback (most recent call last):
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/site-packages/yarr/runners/_env_runner.py", line 169, in _run_env
    raise e
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/site-packages/yarr/runners/_env_runner.py", line 143, in _run_env
    for replay_transition in generator:
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/site-packages/yarr/utils/rollout_generator.py", line 35, in generator
    transition = env.step(act_result)
  File "/home/user/ARM/arm/custom_rlbench_env.py", line 128, in step
    obs, reward, terminal = self._task.step(action)
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/site-packages/rlbench/task_environment.py", line 99, in step
    self._action_mode.action(self._scene, action)
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/site-packages/rlbench/action_modes/action_mode.py", line 32, in action
    arm_action = np.array(action[:arm_act_size])
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/site-packages/torch/_tensor.py", line 732, in __array__
    return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
[2022-05-27 10:10:31,983][root][ERROR] - Env train_env0 failed too many times (11 times > 10)
Exception in thread EnvRunnerThread:
Traceback (most recent call last):
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/site-packages/yarr/runners/env_runner.py", line 134, in _run
    raise RuntimeError('Too many process failures.')
RuntimeError: Too many process failures.

I pulled the current version of RLbench and YARR and did a re-install of all packages in a new conda environment.

I am wondering if you use a different torch version that can handle tensor to numpy automatically better.
Currently I fixed this by adding .cpu() in a few files

YARR

 git diff main

diff --git a/yarr/envs/rlbench_env.py b/yarr/envs/rlbench_env.py
index 6aad118..6460fb1 100644
--- a/yarr/envs/rlbench_env.py
+++ b/yarr/envs/rlbench_env.py
@@ -6,7 +6,7 @@ try:
 except (ModuleNotFoundError, ImportError) as e:
     print("You need to install RLBench: 'https://github.com/stepjam/RLBench'")
     raise e
-from rlbench.action_modes import ActionMode
+from rlbench.action_modes.action_mode import ActionMode
 from rlbench.backend.observation import Observation
 from rlbench.backend.task import Task
 
diff --git a/yarr/utils/rollout_generator.py b/yarr/utils/rollout_generator.py
index d4d2973..a3f12ee 100644
--- a/yarr/utils/rollout_generator.py
+++ b/yarr/utils/rollout_generator.py
@@ -27,7 +27,7 @@ class RolloutGenerator(object):
                                    deterministic=eval)
 
             # Convert to np if not already
-            agent_obs_elems = {k: np.array(v) for k, v in
+            agent_obs_elems = {k: np.array(v.cpu()) for k, v in
                                act_result.observation_elements.items()}
             extra_replay_elements = {k: np.array(v) for k, v in
                                      act_result.replay_elements.items()}
@@ -66,7 +66,7 @@ class RolloutGenerator(object):
                     prepped_data = {k: torch.tensor([v], device=self._env_device) for k, v in obs_history.items()}
                     act_result = agent.act(step_signal.value, prepped_data,
                                            deterministic=eval)
-                    agent_obs_elems_tp1 = {k: np.array(v) for k, v in
+                    agent_obs_elems_tp1 = {k: np.array(v.cpu()) for k, v in
                                            act_result.observation_elements.items()}
                     obs_tp1.update(agent_obs_elems_tp1)
                 replay_transition.final_observation = obs_tp1

(Side note: Also observe that with the recent changes in folder structure in RLbench I changed the import for ActionMode.)

RLBench

git diff master

diff --git a/rlbench/action_modes/action_mode.py b/rlbench/action_modes/action_mode.py
index 68171a37..a2c264ef 100644
--- a/rlbench/action_modes/action_mode.py
+++ b/rlbench/action_modes/action_mode.py
@@ -29,8 +29,8 @@ class MoveArmThenGripper(ActionMode):
 
     def action(self, scene: Scene, action: np.ndarray):
         arm_act_size = np.prod(self.arm_action_mode.action_shape(scene))
-        arm_action = np.array(action[:arm_act_size])
-        ee_action = np.array(action[arm_act_size:])
+        arm_action = np.array(action[:arm_act_size].cpu())
+        ee_action = np.array(action[arm_act_size:].cpu())
         self.arm_action_mode.action(scene, arm_action)
         self.gripper_action_mode.action(scene, ee_action)

I believe that the error comes from a change somewhere else though, or that you use a torch version that can deal with this? Can you please help me? I don't know which pyorch version you are using. It is missing in the requirements.txt. I installed pytorch with conda.
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

An error that I am unable to fix is this one

Exception in thread EnvRunnerThread:
Traceback (most recent call last):
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/site-packages/yarr/runners/env_runner.py", line 141, in _run
    new_transitions = self._update()
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/site-packages/yarr/runners/env_runner.py", line 86, in _update
    self._agent_summaries = list(
  File "<string>", line 2, in __getitem__
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/multiprocessing/managers.py", line 825, in _callmethod
    raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError: 
---------------------------------------------------------------------------
Unserializable message: Traceback (most recent call last):
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/multiprocessing/managers.py", line 300, in serve_client
    send(msg)
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/multiprocessing/connection.py", line 211, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 249, in reduce_tensor
    event_sync_required) = storage._share_cuda_()
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/site-packages/torch/storage.py", line 623, in _share_cuda_
    return self._storage._share_cuda_(*args, **kwargs)
RuntimeError: Attempted to send CUDA tensor received from another process; this is not currently supported. Consider cloning before sending.

---------------------------------------------------------------------------
[W CudaIPCTypes.cpp:92] Producer process tried to deallocate over 1000 memory blocks referred by consumer processes. Deallocation might be significantly slowed down. We assume it will never going to be the case, but if it is, please file but to https://github.com/pytorch/pytorch

Do you have advice? It seems to me like pytorch is the issue for most of the problems I mentioned.

using: Python 3.9.12

The text was updated successfully, but these errors were encountered:

diegomaureira · 2022-07-12T14:52:06Z

Hi Alexander! I have the same problem. Were you able to fix it?. I have another error about "CUDA out of memory", does anyone know which are the hardware requirements?

jianingq · 2022-09-28T03:11:12Z

I encountered the "RuntimeError: Attempted to send CUDA tensor received from another process; this is not currently supported. Consider cloning before sending." problem as well. I think this is because some of the items in summaries are cuda tensors. I solve this error by converting them to numpy arrays.

weixiang-smart · 2022-10-10T10:19:58Z

I encountered the "RuntimeError: Attempted to send CUDA tensor received from another process; this is not currently supported. Consider cloning before sending." problem as well. I think this is because some of the items in summaries are cuda tensors. I solve this error by converting them to numpy arrays.

@jianingq Hi, jianingq. I also meet this problem. I would be grateful if you can provide details of your solution.

jianingq · 2022-10-19T21:06:09Z

I encountered the "RuntimeError: Attempted to send CUDA tensor received from another process; this is not currently supported. Consider cloning before sending." problem as well. I think this is because some of the items in summaries are cuda tensors. I solve this error by converting them to numpy arrays.

@jianingq Hi, jianingq. I also meet this problem. I would be grateful if you can provide details of your solution.

So I think the error message happens when it collects new transitions. So double check elements in the summaries and make sure they are on cpu instead of gpu. For example, I changed ARM/arm/custom_rlbench_env.py line 123 from action = act_result.action to action = act_result.action.cpu().numpy()

weixiang-smart · 2022-11-02T01:54:58Z

I encountered the "RuntimeError: Attempted to send CUDA tensor received from another process; this is not currently supported. Consider cloning before sending." problem as well. I think this is because some of the items in summaries are cuda tensors. I solve this error by converting them to numpy arrays.

@jianingq Hi, jianingq. I also meet this problem. I would be grateful if you can provide details of your solution.

So I think the error message happens when it collects new transitions. So double check elements in the summaries and make sure they are on cpu instead of gpu. For example, I changed ARM/arm/custom_rlbench_env.py line 123 from action = act_result.action to action = act_result.action.cpu().numpy()

Thank you so much! It helps me solve the problem.

kevin-xuan · 2023-01-01T11:59:40Z

@weixiang-smart @jianingq , I modify action = act_result.action to action = act_result.action.cpu().numpy() but still encounter "RuntimeError: Attempted to send CUDA tensor received from another process; this is not currently supported. Consider cloning before sending." when the process eval_envs runs for collecting new transitions. Is there any other code to be modified? Or is there some wrong with pytorch version (my pytorch=1.13.1)? I really appreciate any help you can provide.

kevin-xuan · 2023-01-03T08:41:41Z

The error happens when collecting new transitions, so it is not about the training process. Therefore, for example, consider a BCAgent, I check the code again and find that the error seems to be related with this line. I modify this line into return ActResult(mu[0].cpu().detach().numpy()) instead of this solution. If you want to run QAttentionAgent of C2FARM method, maybe you could try to modify that line.

yananliusdu · 2023-02-21T10:16:29Z

@weixiang-smart @jianingq , I modify action = act_result.action to action = act_result.action.cpu().numpy() but still encounter "RuntimeError: Attempted to send CUDA tensor received from another process; this is not currently supported. Consider cloning before sending." when the process eval_envs runs for collecting new transitions. Is there any other code to be modified? Or is there some wrong with pytorch version (my pytorch=1.13.1)? I really appreciate any help you can provide.

I met this kind of issues too even with '.cpu().numpy()', might be somewhere else that I forget to change?

TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

kevin-xuan · 2023-02-21T11:29:56Z

@yananliusdu you can try the above method

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARM (and YARR) conflicts with current RLBench (1.2.0) #8

ARM (and YARR) conflicts with current RLBench (1.2.0) #8

alexanderdurr commented May 27, 2022 •

edited

diegomaureira commented Jul 12, 2022

jianingq commented Sep 28, 2022

weixiang-smart commented Oct 10, 2022

jianingq commented Oct 19, 2022

weixiang-smart commented Nov 2, 2022

kevin-xuan commented Jan 1, 2023

kevin-xuan commented Jan 3, 2023 •

edited

yananliusdu commented Feb 21, 2023

kevin-xuan commented Feb 21, 2023

ARM (and YARR) conflicts with current RLBench (1.2.0) #8

ARM (and YARR) conflicts with current RLBench (1.2.0) #8

Comments

alexanderdurr commented May 27, 2022 • edited

diegomaureira commented Jul 12, 2022

jianingq commented Sep 28, 2022

weixiang-smart commented Oct 10, 2022

jianingq commented Oct 19, 2022

weixiang-smart commented Nov 2, 2022

kevin-xuan commented Jan 1, 2023

kevin-xuan commented Jan 3, 2023 • edited

yananliusdu commented Feb 21, 2023

kevin-xuan commented Feb 21, 2023

alexanderdurr commented May 27, 2022 •

edited

kevin-xuan commented Jan 3, 2023 •

edited