[RLlib; Core; Tune]: Ray keeps crashing during tune run. #39726
Labels
bug
Something that is supposed to be working; but isn't
@external-author-action-required
Alternate tag for PRs where the author doesn't have labeling permission.
P1
Issue that should be fixed within a few weeks
QS
Quantsight triage label
windows
What happened + What you expected to happen
For the past few days, all training runs have been failing between 6 to 10 hours into the training. I get this output:
` (raylet) [2023-09-18 05:59:34,944 C 19680 8204] (raylet.exe) dlmalloc.cc:129: Check failed: *handle != nullptr CreateFileMapping() failed. GetLastError() = 1455
(raylet) *** StackTrace Information ***
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) recalloc
(raylet) BaseThreadInitThunk
(raylet) RtlUserThreadStart
(raylet)
(RolloutWorker pid=21400) C:\arrow\cpp\src\arrow\filesystem\s3fs.cc:
(RolloutWorker pid=6816) C:\arrow\cpp\src\arrow\filesystem\s3fs.cc:2598: arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit
2023-09-18 05:59:38,368 ERROR tune_controller.py:1502 -- Trial task failed for trial PPO_CustomEnv-v0_ff24b_00001
Traceback (most recent call last):
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray\air\execution_internal\event_manager.py", line 110, in resolve_future
result = ray.get(future)
^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\worker.py", line 2562, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
Trial PPO_CustomEnv-v0_ff24b_00001 errored after 53 iterations at 2023-09-18 05:59:38. Total running time: 6hr 29min 28s
Error file: C:/Users/user/ray_results\Ndovu1\PPO_CustomEnv-v0_ff24b_00001_1_clip_param=0.1429,gamma=0.9634,kl_coeff=0.0016,kl_target=0.0028,lambda=0.9993,lr=0.0019,lr=0.0012_2023-09-17_23-30-10\error.txt
╭───────────────────────────────────────────────────────────╮
│ Trial PPO_CustomEnv-v0_ff24b_00001 result │
├───────────────────────────────────────────────────────────┤
│ episodes_total 15 │
│ evaluation/sampler_results/episode_reward_mean nan │
│ num_env_steps_sampled 15900 │
│ num_env_steps_trained 15900 │
│ sampler_results/episode_len_mean 1031 │
│ sampler_results/episode_reward_mean -13584.7 │
╰───────────────────────────────────────────────────────────╯
2023-09-18 05:59:38,462 ERROR tune_controller.py:1502 -- Trial task failed for trial PPO_CustomEnv-v0_ff24b_00000
Traceback (most recent call last):
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray\air\execution_internal\event_manager.py", line 110, in resolve_future
result = ray.get(future)
^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\worker.py", line 2562, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
Trial PPO_CustomEnv-v0_ff24b_00000 errored after 52 iterations at 2023-09-18 05:59:38. Total running time: 6hr 29min 28s
Error file: C:/Users/user/ray_results\Ndovu1\PPO_CustomEnv-v0_ff24b_00000_0_clip_param=0.0628,gamma=0.8613,kl_coeff=0.0089,kl_target=0.0021,lambda=0.9291,lr=0.0013,lr=0.0030_2023-09-17_23-30-10\error.txt
╭───────────────────────────────────────────────────────────╮
│ Trial PPO_CustomEnv-v0_ff24b_00000 result │
├───────────────────────────────────────────────────────────┤
│ episodes_total 15 │
│ evaluation/sampler_results/episode_reward_mean nan │
│ num_env_steps_sampled 15600 │
│ num_env_steps_trained 15600 │
│ sampler_results/episode_len_mean 1031 │
│ sampler_results/episode_reward_mean -17543.1 │
╰───────────────────────────────────────────────────────────╯
2023-09-18 05:59:49,841 WARNING worker.py:2071 -- The node with node id: 2628e1894464566f5f0e56ebf56cee56db24835db98d85153a0d0172 and address: 127.0.0.1 and node name: 127.0.0.1 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, preempted node, etc.)
(2) raylet has lagging heartbeats due to slow network or busy workload.
2023-09-18 05:59:49,857 ERROR tune_controller.py:1502 -- Trial task failed for trial PPO_CustomEnv-v0_ff24b_00000
Traceback (most recent call last):
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray\air\execution_internal\event_manager.py", line 110, in resolve_future
result = ray.get(future)
^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\worker.py", line 2562, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: PPO
actor_id: 38a34301b858e9124e3ad4f501000000
namespace: 821b5c5c-a045-4b82-b2c8-e052ba786a9c
The actor is dead because its owner has died. Owner Id: 01000000ffffffffffffffffffffffffffffffffffffffffffffffff Owner Ip address: 127.0.0.1 Owner worker exit type: SYSTEM_ERROR Worker exit detail: Owner's node has crashed.
The actor never ran - it was cancelled before it started running.
Trial PPO_CustomEnv-v0_ff24b_00000 errored after 52 iterations at 2023-09-18 05:59:49. Total running time: 6hr 29min 39s
Error file: C:/Users/user/ray_results\Ndovu1\PPO_CustomEnv-v0_ff24b_00000_0_clip_param=0.0628,gamma=0.8613,kl_coeff=0.0089,kl_target=0.0021,lambda=0.9291,lr=0.0013,lr=0.0030_2023-09-17_23-30-10\error.txt
2023-09-18 05:59:49,888 ERROR tune_controller.py:1502 -- Trial task failed for trial PPO_CustomEnv-v0_ff24b_00001
Traceback (most recent call last):
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray\air\execution_internal\event_manager.py", line 110, in resolve_future
result = ray.get(future)
^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\worker.py", line 2562, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: PPO
actor_id: aeca46f4aa7611558d9fd36a01000000
namespace: 821b5c5c-a045-4b82-b2c8-e052ba786a9c
The actor is dead because its node has died. Node Id: 2628e1894464566f5f0e56ebf56cee56db24835db98d85153a0d0172
The actor never ran - it was cancelled before it started running.
Trial PPO_CustomEnv-v0_ff24b_00001 errored after 53 iterations at 2023-09-18 05:59:49. Total running time: 6hr 29min 39s
Error file: C:/Users/user/ray_results\Ndovu1\PPO_CustomEnv-v0_ff24b_00001_1_clip_param=0.1429,gamma=0.9634,kl_coeff=0.0016,kl_target=0.0028,lambda=0.9993,lr=0.0019,lr=0.0012_2023-09-17_23-30-10\error.txt
2023-09-18 05:59:50,249 WARNING resource_updater.py:262 -- Cluster resources not detected or are 0. Attempt #2..`.
I took note of the resource usage for the all the experiments I've had and the resources are not streched:
I'm using the ray for python 3.11 on Windows. Fisrt, I tried, the stable 2.6.3 release and then the Nightly release, however, both versions have the same outcome. I also tried training on cloud vms and the outcome is the same. What could be the issue
Versions / Dependencies
Windows 11 and Ubuntu 22
Ray 2.6.3 and Ray Nightly
Python 3.11
Reproduction script
`self.exp_name = "Ndovu"
args = "PPO"
self.hp_ranges = HPRanges()
self.trainerHPs = PPOLearnerHPs(params=self.hp_ranges).config
self.algo = PPOConfig()
self.trainer = args
hyper_dict = {
# distribution for resampling
"gamma": self.hp_ranges.gamma,
"lr": self.hp_ranges.lr,
"vf_loss_coeff": self.hp_ranges.vf_loss_coeff,
"kl_coeff": self.hp_ranges.kl_coeff,
"kl_target": self.hp_ranges.kl_target,
"lambda_": self.hp_ranges.lambda_,
"clip_param": self.hp_ranges.clip_param,
"grad_clip": self.hp_ranges.grad_clip,
}
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: