Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib; Core; Tune]: Ray keeps crashing during tune run. #39726

Closed
grizzlybearg opened this issue Sep 18, 2023 · 3 comments
Closed

[RLlib; Core; Tune]: Ray keeps crashing during tune run. #39726

grizzlybearg opened this issue Sep 18, 2023 · 3 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. P1 Issue that should be fixed within a few weeks QS Quantsight triage label windows

Comments

@grizzlybearg
Copy link

grizzlybearg commented Sep 18, 2023

What happened + What you expected to happen

For the past few days, all training runs have been failing between 6 to 10 hours into the training. I get this output:
` (raylet) [2023-09-18 05:59:34,944 C 19680 8204] (raylet.exe) dlmalloc.cc:129: Check failed: *handle != nullptr CreateFileMapping() failed. GetLastError() = 1455
(raylet) *** StackTrace Information ***
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) recalloc
(raylet) BaseThreadInitThunk
(raylet) RtlUserThreadStart
(raylet)
(RolloutWorker pid=21400) C:\arrow\cpp\src\arrow\filesystem\s3fs.cc:
(RolloutWorker pid=6816) C:\arrow\cpp\src\arrow\filesystem\s3fs.cc:2598: arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit
2023-09-18 05:59:38,368 ERROR tune_controller.py:1502 -- Trial task failed for trial PPO_CustomEnv-v0_ff24b_00001
Traceback (most recent call last):
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray\air\execution_internal\event_manager.py", line 110, in resolve_future
result = ray.get(future)
^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\worker.py", line 2562, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Trial PPO_CustomEnv-v0_ff24b_00001 errored after 53 iterations at 2023-09-18 05:59:38. Total running time: 6hr 29min 28s
Error file: C:/Users/user/ray_results\Ndovu1\PPO_CustomEnv-v0_ff24b_00001_1_clip_param=0.1429,gamma=0.9634,kl_coeff=0.0016,kl_target=0.0028,lambda=0.9993,lr=0.0019,lr=0.0012_2023-09-17_23-30-10\error.txt
╭───────────────────────────────────────────────────────────╮
│ Trial PPO_CustomEnv-v0_ff24b_00001 result │
├───────────────────────────────────────────────────────────┤
│ episodes_total 15 │
│ evaluation/sampler_results/episode_reward_mean nan │
│ num_env_steps_sampled 15900 │
│ num_env_steps_trained 15900 │
│ sampler_results/episode_len_mean 1031 │
│ sampler_results/episode_reward_mean -13584.7 │
╰───────────────────────────────────────────────────────────╯
2023-09-18 05:59:38,462 ERROR tune_controller.py:1502 -- Trial task failed for trial PPO_CustomEnv-v0_ff24b_00000
Traceback (most recent call last):
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray\air\execution_internal\event_manager.py", line 110, in resolve_future
result = ray.get(future)
^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\worker.py", line 2562, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Trial PPO_CustomEnv-v0_ff24b_00000 errored after 52 iterations at 2023-09-18 05:59:38. Total running time: 6hr 29min 28s
Error file: C:/Users/user/ray_results\Ndovu1\PPO_CustomEnv-v0_ff24b_00000_0_clip_param=0.0628,gamma=0.8613,kl_coeff=0.0089,kl_target=0.0021,lambda=0.9291,lr=0.0013,lr=0.0030_2023-09-17_23-30-10\error.txt
╭───────────────────────────────────────────────────────────╮
│ Trial PPO_CustomEnv-v0_ff24b_00000 result │
├───────────────────────────────────────────────────────────┤
│ episodes_total 15 │
│ evaluation/sampler_results/episode_reward_mean nan │
│ num_env_steps_sampled 15600 │
│ num_env_steps_trained 15600 │
│ sampler_results/episode_len_mean 1031 │
│ sampler_results/episode_reward_mean -17543.1 │
╰───────────────────────────────────────────────────────────╯
2023-09-18 05:59:49,841 WARNING worker.py:2071 -- The node with node id: 2628e1894464566f5f0e56ebf56cee56db24835db98d85153a0d0172 and address: 127.0.0.1 and node name: 127.0.0.1 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, preempted node, etc.)
(2) raylet has lagging heartbeats due to slow network or busy workload.
2023-09-18 05:59:49,857 ERROR tune_controller.py:1502 -- Trial task failed for trial PPO_CustomEnv-v0_ff24b_00000
Traceback (most recent call last):
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray\air\execution_internal\event_manager.py", line 110, in resolve_future
result = ray.get(future)
^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\worker.py", line 2562, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: PPO
actor_id: 38a34301b858e9124e3ad4f501000000
namespace: 821b5c5c-a045-4b82-b2c8-e052ba786a9c
The actor is dead because its owner has died. Owner Id: 01000000ffffffffffffffffffffffffffffffffffffffffffffffff Owner Ip address: 127.0.0.1 Owner worker exit type: SYSTEM_ERROR Worker exit detail: Owner's node has crashed.
The actor never ran - it was cancelled before it started running.
Trial PPO_CustomEnv-v0_ff24b_00000 errored after 52 iterations at 2023-09-18 05:59:49. Total running time: 6hr 29min 39s
Error file: C:/Users/user/ray_results\Ndovu1\PPO_CustomEnv-v0_ff24b_00000_0_clip_param=0.0628,gamma=0.8613,kl_coeff=0.0089,kl_target=0.0021,lambda=0.9291,lr=0.0013,lr=0.0030_2023-09-17_23-30-10\error.txt
2023-09-18 05:59:49,888 ERROR tune_controller.py:1502 -- Trial task failed for trial PPO_CustomEnv-v0_ff24b_00001
Traceback (most recent call last):
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray\air\execution_internal\event_manager.py", line 110, in resolve_future
result = ray.get(future)
^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\worker.py", line 2562, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: PPO
actor_id: aeca46f4aa7611558d9fd36a01000000
namespace: 821b5c5c-a045-4b82-b2c8-e052ba786a9c
The actor is dead because its node has died. Node Id: 2628e1894464566f5f0e56ebf56cee56db24835db98d85153a0d0172
The actor never ran - it was cancelled before it started running.

Trial PPO_CustomEnv-v0_ff24b_00001 errored after 53 iterations at 2023-09-18 05:59:49. Total running time: 6hr 29min 39s
Error file: C:/Users/user/ray_results\Ndovu1\PPO_CustomEnv-v0_ff24b_00001_1_clip_param=0.1429,gamma=0.9634,kl_coeff=0.0016,kl_target=0.0028,lambda=0.9993,lr=0.0019,lr=0.0012_2023-09-17_23-30-10\error.txt
2023-09-18 05:59:50,249 WARNING resource_updater.py:262 -- Cluster resources not detected or are 0. Attempt #2..`.
I took note of the resource usage for the all the experiments I've had and the resources are not streched:
image
I'm using the ray for python 3.11 on Windows. Fisrt, I tried, the stable 2.6.3 release and then the Nightly release, however, both versions have the same outcome. I also tried training on cloud vms and the outcome is the same. What could be the issue

Versions / Dependencies

Windows 11 and Ubuntu 22
Ray 2.6.3 and Ray Nightly
Python 3.11

Reproduction script

`self.exp_name = "Ndovu"
args = "PPO"
self.hp_ranges = HPRanges()
self.trainerHPs = PPOLearnerHPs(params=self.hp_ranges).config
self.algo = PPOConfig()
self.trainer = args

    # Training
    self.framework = "torch"
    self.preprocessor_pref = "rllib"
    self.observation_filter = "MeanStdFilter"
    self.train_batch_size = self.hp_ranges.train_batch_size
    self.max_seq_len = 20

    # Rollouts
    self.max_iterations = 500
    self.num_rollout_workers = 1
    self.rollout_fragment_length = round(self.train_batch_size / 3)
    self.batch_mode = "truncate_episodes"
    self.create_env_on_local_worker = (
        True if self.num_rollout_workers == 0 else False
    )
    self.num_envs_per_worker = 1
    self.remote_worker_envs = False
    # Remote envs only make sense to use if num_envs > 1 (i.e. environment vectorization is enabled) or env takes long to step thru env but has addittional overhead costs.

    # Resources
    self.num_learner_workers = 1

    self.num_cpus_per_worker = 1
    self.num_cpus_for_local_worker = 1
    self.num_cpus_per_learner_worker = 1

    self.num_gpus = 0
    self.num_gpus_per_worker = 0
    self.num_gpus_per_learner_worker = 0
    self._fake_gpus = False

    self.custom_resources_per_worker = None

    self.placement_strategy = "SPREAD"

    # Evaluation
    self.evaluation_parallel_to_training = False
    self.evaluation_num_workers = 1
    self.evaluation_duration_unit = "episodes"
    self.evaluation_duration = 1
    self.evaluation_frequency = round(self.max_iterations / 1)

    # Exploration
    self.random_steps = 250

    # Tuner
    self.num_samples = 2
    self.max_concurrent_trials = 2
    self.time_budget_s = None

    # Logging
    self.sep = "/"  # ray is sensitive with file names on windows
    self.dir = f"C:{self.sep}Users{self.sep}user{self.sep}ray_results"
    self.log_dir, self.log_name = self._log_dir(
        Path(self.dir), self.exp_name, self.sep
    )
    ts = f"trial_summaries{self.sep}{self.exp_name}"
    self.summaries_dir = Path(self.dir).joinpath(ts)

    if not self.summaries_dir.exists():
        self.summaries_dir.mkdir(parents=True)

    print(f"Log Name: {self.log_name}")
    print(f"Log directory: {self.log_dir.as_posix()}")
    print(f"Summaries directory {self.summaries_dir.as_posix()}")

    # Metrics
    self.metrics = "episode_reward_mean"  
    self.mode = "max"

    # Checkpoints, sync & perturbs
    self.score = self.metrics
    self.checkpoint_frequency = (
        round(self.max_iterations / 10)
        if self.max_iterations <= 200
        else round(self.max_iterations / 33.33)
    )
    self.pertub_frequency = self.checkpoint_frequency
    self.pertub_burn_period = self.checkpoint_frequency * 2
    self.num_to_keep = 3

    # Others
    self.verbose = 1

    # Register Model
    self.model_3 = {
        "max_seq_len": 10,
        "use_lstm": True,
        "lstm_cell_size": 2048,
        "lstm_use_prev_action": True,
        "lstm_use_prev_reward": True,
        "fcnet_hiddens": [1024, 2048, 2048, 4096, 4096, 2048, 2048, 1024, 1024],
        "post_fcnet_hiddens": [1024, 1024, 1024, 1024],
        "fcnet_activation": "swish",
        "post_fcnet_activation": "swish",
        "vf_share_layers": True,
        "no_final_linear": True,
    }  # 166,043,706 params


    self.exploration_config = {
        "type": "StochasticSampling",  # Default for PG algorithms
        # StochasticSampling can be made deterministic by passing explore=False into the call to `get_exploration_action`. Also allows for scheduled parameters for the distributions, such as lowering stddev, temperature, etc.. over time.
        "random_timesteps": self.random_steps,  # Int
        "framework": self.framework,
    }

    self.centered_adam = False
    self.optimizer_config = {
        "type": "RAdam",
        "lr": self.hp_ranges.lr,
        "betas": (0.9, 0.999),
        #'beta2': 0.999, # Only used if centered=False.
        "eps": 1e-08,  # Only used if centered=False.
        #'weight_decay': #floatself.centered_adam,
        #'amsgrad': False # Only used if centered=False.
    }

    # ENV
    self.env = env_name
    self.render_env = False
    self.evaluation_config_ = self.algo.overrides(  # type: ignore
        explore=False, render_env=False
    )

    self.config = (
        self.algo.update_from_dict(config_dict=self.trainerHPs.to_dict())
        .environment(
            env=self.env,
            env_config=self.train_env_config,
            # env=CartPoleEnv, #testing
            render_env=self.render_env,
            clip_rewards=None,
            auto_wrap_old_gym_envs=False,
            disable_env_checking=True,
            is_atari=False,
        )
        .framework(
            framework="torch",
            torch_compile_learner=True,  # For enabling torch-compile during training
            torch_compile_learner_dynamo_backend="ipex",
            torch_compile_learner_dynamo_mode="default",
            torch_compile_worker=True,  # For enabling torch-compile during sampling
            torch_compile_worker_dynamo_backend="ipex",
            torch_compile_worker_dynamo_mode="default",
        )
        .debugging(log_level="ERROR", log_sys_usage=True)  # type: ignore
        .rollouts(
            num_rollout_workers=self.num_rollout_workers,
            num_envs_per_worker=self.num_envs_per_worker,
            create_env_on_local_worker=self.create_env_on_local_worker,
            enable_connectors=True,
            rollout_fragment_length=self.rollout_fragment_length,
            batch_mode=self.batch_mode,
            # remote_worker_envs=self.remote_worker_envs,
            # remote_env_batch_wait_ms=0,
            validate_workers_after_construction=True,
            preprocessor_pref=self.preprocessor_pref,
            observation_filter=self.observation_filter,  # TODO: Test NoFilter
            update_worker_filter_stats=True,
            compress_observations=False,  # TODO: Test True
        )
        .fault_tolerance(
            recreate_failed_workers=True,
            max_num_worker_restarts=10,
            delay_between_worker_restarts_s=30,
            restart_failed_sub_environments=True,
            num_consecutive_worker_failures_tolerance=10,
            worker_health_probe_timeout_s=300,
            worker_restore_timeout_s=180,
        )
        .resources(
            num_cpus_per_worker=self.num_cpus_per_worker,
            # num_gpus_per_worker= self.num_gpus_per_worker,
            num_cpus_for_local_worker=self.num_cpus_for_local_worker,
            num_learner_workers=self.num_learner_workers,
            num_cpus_per_learner_worker=self.num_cpus_per_learner_worker,
            placement_strategy=self.placement_strategy,
        )
        .exploration(explore=True, exploration_config=self.exploration_config)
        .checkpointing(
            export_native_model_files=False,
            checkpoint_trainable_policies_only=False,
        )  # Bool
        .evaluation(
            evaluation_interval=self.evaluation_frequency,
            evaluation_duration=self.evaluation_duration,
            evaluation_duration_unit=self.evaluation_duration_unit,
            evaluation_sample_timeout_s=600,
            evaluation_parallel_to_training=self.evaluation_parallel_to_training,
            # evaluation_config = self.evaluation_config_,
            # off_policy_estimation_methods = {}, # See Notes in Next Cell
            # ope_split_batch_by_episode = True, # default
            evaluation_num_workers=self.evaluation_num_workers,
            # custom_evaluation_function = None,
            always_attach_evaluation_results=True,
            enable_async_evaluation=True
            if self.evaluation_num_workers > 1
            else False,
        )
        .callbacks(MyCallbacks)
        .rl_module(
            _enable_rl_module_api=False,
            # rl_module_spec=module_to_load_spec
        )
        .training(
            # gamma=0.98,  # ,
            # lr=1e-5,
            gamma=self.hp_ranges.gamma,  # type: ignore
            lr=self.hp_ranges.lr,  # type: ignore
            grad_clip_by="norm",  # type: ignore
            grad_clip=0.3,
            train_batch_size=self.hp_ranges.train_batch_size,  # type: ignore
            model=self.model_3,  # type: ignore
            optimizer=self.optimizer_config,
            _enable_learner_api=False,
            # learner_class=None
        )
    )
    self.config_dict = self.config.to_dict()

    self.stopper = CombinedStopper(
        MaximumIterationStopper(max_iter=self.max_iterations),
        TrialPlateauStopper(
            metric=self.metrics,
            std=0.04,
            num_results=10,
            grace_period=100,
            metric_threshold=200,
            mode="max",
        ),
    )

    self.checkpointer = CheckpointConfig(
        num_to_keep=self.num_to_keep,
        checkpoint_score_attribute=self.score,
        checkpoint_score_order=self.mode,
        checkpoint_frequency=self.checkpoint_frequency,
        checkpoint_at_end=True,
    )

    self.failure_check = FailureConfig(max_failures=5, fail_fast=False)

    self.sync_config = SyncConfig(
        # syncer=None,
        sync_period=7200,
        sync_timeout=7200,
        sync_artifacts=True,
        sync_artifacts_on_checkpoint=True,
    )

hyper_dict = {
# distribution for resampling
"gamma": self.hp_ranges.gamma,
"lr": self.hp_ranges.lr,
"vf_loss_coeff": self.hp_ranges.vf_loss_coeff,
"kl_coeff": self.hp_ranges.kl_coeff,
"kl_target": self.hp_ranges.kl_target,
"lambda_": self.hp_ranges.lambda_,
"clip_param": self.hp_ranges.clip_param,
"grad_clip": self.hp_ranges.grad_clip,
}

    self.pbt_scheduler = PopulationBasedTraining(
        time_attr="training_iteration",
        perturbation_interval=self.pertub_frequency,
        burn_in_period=self.pertub_burn_period,
        hyperparam_mutations=hyper_dict,  # type:ignore
        quantile_fraction=0.50,  # Paper default
        resample_probability=0.20,
        perturbation_factors=(1.2, 0.8),  # Paper default
        # custom_explore_fn = None
    )`

Issue Severity

High: It blocks me from completing my task.

@grizzlybearg grizzlybearg added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 18, 2023
@sven1977
Copy link
Contributor

Hey @grizzlybearg , thanks for raising this issue. Could you try to boil down your reproduction script to a manageable/debuggable size? Then we might be able to better assist. Possible questions a debugger would have would be:

  • Does this error occur on a local (laptop) setup?
  • Does it happen for a simpler built-in algo, like PPO?
  • With a simpler setup: no eval workers, only one or zero remote workers (num_workers=0), etc..
  • Without Ray Tune or at least with a simpler setup (no checkpoints, no pbt, hyperparam tuning, etc)

@sven1977 sven1977 changed the title [Rllib, Core]: Ray keeps crashing during tune [RLlib; Core; Tune]: Ray keeps crashing during tune run. Sep 21, 2023
@rkooo567 rkooo567 added core Issues that should be addressed in Ray Core windows QS Quantsight triage label P2 Important issue, but not time-critical P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) P2 Important issue, but not time-critical core Issues that should be addressed in Ray Core labels Sep 25, 2023
@mattip
Copy link
Contributor

mattip commented Sep 26, 2023

I think the error indicates an out-of-memory problem when calling CreateFileMapping (error code 1455 indicates "ERROR_COMMITMENT_LIMIT: The paging file is too small for this operation to complete"). Is there a reason you use Windows in a cloud cluster? In general linux is to be preferred since it is better tested and usually costs less.

Your monitoring seems to show a memory spike around 5.5 hours, and goes over 100% a bit later. Maybe that is connected to the out-of-memory error?

You start your report with "For the past few days...". Was it working differently before that?

@anyscalesam anyscalesam added the @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. label Oct 25, 2023
@anyscalesam
Copy link
Contributor

anyscalesam commented Nov 8, 2023

Re-reviewing this with @mattip sounds like it's a memory issue (we have some hypothesis on way in comparison between Windows and Linux).

@grizzlybearg can you try doubling the memory and see if it passes/progresses further? Also can you answer the clarifying questions from @mattip above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. P1 Issue that should be fixed within a few weeks QS Quantsight triage label windows
Projects
None yet
Development

No branches or pull requests

5 participants