Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rllib] Memory leak in environment worker in multi-agent setup #9964

Closed
2 tasks done
sergeivolodin opened this issue Aug 6, 2020 · 14 comments · Fixed by #15815
Closed
2 tasks done

[rllib] Memory leak in environment worker in multi-agent setup #9964

sergeivolodin opened this issue Aug 6, 2020 · 14 comments · Fixed by #15815
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical
Milestone

Comments

@sergeivolodin
Copy link

What is the problem?

When training in a multi-agent environment using multiple environment workers, the memory of the workers increases constantly and is not released after the policy updates.

image

== Status ==
Memory usage on this node: 5.2/8.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/10 CPUs, 0/0 GPUs, 0.0/5.03 GiB heap, 0.0/1.71 GiB objects
Result logdir: /home/sergei/ray_results/dummy_run
Number of trials: 1 (1 ERROR)
+------------------------------------+----------+-------+--------+------------------+---------+----------+
| Trial name                         | status   | loc   |   iter |   total time (s) |      ts |   reward |
|------------------------------------+----------+-------+--------+------------------+---------+----------|
| PPO_DummyMultiAgentEnv_fbe89_00000 | ERROR    |       |     57 |          4463.08 | 1881000 |        0 |
+------------------------------------+----------+-------+--------+------------------+---------+----------+
Number of errored trials: 1
+------------------------------------+--------------+---------------------------------------------------------------------------------------------------+
| Trial name                         |   # failures | error file                                                                                        |
|------------------------------------+--------------+---------------------------------------------------------------------------------------------------|
| PPO_DummyMultiAgentEnv_fbe89_00000 |            1 | /home/sergei/ray_results/dummy_run/PPO_DummyMultiAgentEnv_0_2020-08-05_16-21-42pm5cl_th/error.txt |
+------------------------------------+--------------+---------------------------------------------------------------------------------------------------+
ray.exceptions.RayTaskError(RayOutOfMemoryError): ray::RolloutWorker.par_iter_next() (pid=7604, ip=192.168.175.153)
  File "python/ray/_raylet.pyx", line 408, in ray._raylet.execute_task
  File "/home/sergei/miniconda3/envs/fresh_ray/lib/python3.8/site-packages/ray/memory_monitor.py", line 137, in raise_if_low_memory
    raise RayOutOfMemoryError(
ray.memory_monitor.RayOutOfMemoryError: Heap memory usage for ray_RolloutWorker_7604 is 0.4395 / 0.4395 GiB limit

If no memory limit is set, the processes run out of system memory and are killed. If the memory_per_worker limit is set, they go past the limit and are killed.

Ray version and other system information (Python version, TensorFlow version, OS):

Name Value
Ray version ray==0.8.6
Python version Python 3.8.5 (default, Aug 5 2020, 08:36:46)
TensorFlow version tensorflow==2.3.0
OS Ubuntu 18.04.4 LTS (GNU/Linux 4.19.121-microsoft-standard x86_64) (same thing in non-wsl as well)

Reproduction

Run this and measure memory consumption. If you remove memory_per_worker limits, it will take longer, as workers will try to consume all system memory.

import numpy as np
import ray
import tensorflow as tf
from ray import tune
from ray.rllib.agents.ppo import PPOTrainer
from ray.rllib.agents.ppo.ppo_tf_policy import PPOTFPolicy
from ray.rllib.env.multi_agent_env import MultiAgentEnv
import numpy as np
from gym.spaces import Box
from ray.tune.registry import register_env


def dim_to_gym_box(dim, val=np.inf):
    """Create gym.Box with specified dimension."""
    high = np.full((dim,), fill_value=val)
    return Box(low=-high, high=high)


class DummyMultiAgentEnv(MultiAgentEnv):
    """Return zero observations."""

    def __init__(self, config):
        del config  # Unused
        super(DummyMultiAgentEnv, self).__init__()
        self.config = dict(act_dim=17, obs_dim=380, n_players=2, n_steps=1000)
        self.players = ["player_%d" % p for p in range(self.config['n_players'])]
        self.current_step = 0

    def _obs(self):
        return np.zeros((self.config['obs_dim'],))

    def reset(self):
        self.current_step = 0
        return {p: self._obs() for p in self.players}

    def step(self, action_dict):
        done = self.current_step >= self.config['n_steps']
        self.current_step += 1

        obs = {p: self._obs() for p in self.players}
        rew = {p: 0.0 for p in self.players}
        dones = {p: done for p in self.players + ["__all__"]}
        infos = {p: {} for p in self.players}

        return obs, rew, dones, infos

    @property
    def observation_space(self):
        return dim_to_gym_box(self.config['obs_dim'])

    @property
    def action_space(self):
        return dim_to_gym_box(self.config['act_dim'])


def create_env(config):
    """Create the dummy environment."""
    return DummyMultiAgentEnv(config)


env_name = "DummyMultiAgentEnv"
register_env(env_name, create_env)


def get_trainer_config(env_config, train_policies, num_workers=5, framework="tfe"):
    """Build configuration for 1 run."""

    # obtaining parameters from the environment
    env = create_env(env_config)
    act_space = env.action_space
    obs_space = env.observation_space
    players = env.players
    del env

    def get_policy_config(p):
        """Get policy configuration for a player."""
        return (PPOTFPolicy, obs_space, act_space, {
            "model": {
                "use_lstm": False,
                "fcnet_hiddens": [64, 64],
            },
            "framework": framework,
        })

    # creating policies
    policies = {p: get_policy_config(p) for p in players}

    # trainer config
    config = {
        "env": env_name, "env_config": env_config, "num_workers": num_workers,
        "multiagent": {"policy_mapping_fn": lambda x: x, "policies": policies,
                       "policies_to_train": train_policies},
        "framework": framework,
        "train_batch_size": 32768,
        "num_sgd_iter": 1,
        "sgd_minibatch_size": 32768,

        # 450 megabytes for each worker (enough for 1 iteration)
        "memory_per_worker": 450 * 1024 * 1024,
        "object_store_memory_per_worker": 128 * 1024 * 1024,
    }
    return config


def tune_run():
    ray.init(ignore_reinit_error=True)
    config = get_trainer_config(train_policies=['player_1', 'player_2'], env_config={})
    return tune.run("PPO", config=config, verbose=True, name="dummy_run", num_samples=1)


if __name__ == '__main__':
    tune_run()
  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.
@sergeivolodin sergeivolodin added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 6, 2020
@richardliaw
Copy link
Contributor

@sergeivolodin How high does it go? Also, can you try different Tensorflow versions?

@richardliaw richardliaw added P2 Important issue, but not time-critical rllib and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 7, 2020
@sergeivolodin
Copy link
Author

@richardliaw Here it ate all the remaining memory on my laptop and crashed wsl. Which versions of tensorflow do you want me to try?

image

@jyericlin
Copy link

jyericlin commented Aug 20, 2020

We also observed the same issue running QMIX with PyTorch on SMAC. In my case, the machine has 16GB of RAM, and the training would eventually consume all the RAM and crash (at about 600+ iteration using the 2s3z map).

Tried setting object_store_memory=1*10**9, redis_max_memory=1*10**9 for ray init but didn't help.

Ray version: 0.8.7
OS: Ubuntu 16.04
Pytorch version: 1.3.0
command used: python run_qmix.py --num-iters=1000 --num-workers=7 --map-name=2s3z (using the example from SMAC repository)

Error message:

Failure # 1 (occurred at 2020-08-20_13-12-55)
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/raydev/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 471, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/home/ubuntu/anaconda3/envs/raydev/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 430, in fetch_result
    result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
  File "/home/ubuntu/anaconda3/envs/raydev/lib/python3.6/site-packages/ray/worker.py", line 1538, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RayOutOfMemoryError): ^[[36mray::QMIX.train()^[[39m (pid=949, ip=10.11.0.13)
  File "python/ray/_raylet.pyx", line 440, in ray._raylet.execute_task
  File "/home/ubuntu/anaconda3/envs/raydev/lib/python3.6/site-packages/ray/memory_monitor.py", line 128, in raise_if_low_memory
    self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node rltb-1 is used (14.9 / 15.67 GB). The top 10 memory consumers are:

PID MEM COMMAND
949 6.25GiB ray::QMIX
4628    0.5GiB  /home/ubuntu/StarCraftII/Versions/Base69232/SC2_x64 -listen 127.0.0.1 -port 24827 -dataDir /home/ubu
4631    0.5GiB  /home/ubuntu/StarCraftII/Versions/Base69232/SC2_x64 -listen 127.0.0.1 -port 24013 -dataDir /home/ubu
4637    0.5GiB  /home/ubuntu/StarCraftII/Versions/Base69232/SC2_x64 -listen 127.0.0.1 -port 20642 -dataDir /home/ubu
4625    0.5GiB  /home/ubuntu/StarCraftII/Versions/Base69232/SC2_x64 -listen 127.0.0.1 -port 20480 -dataDir /home/ubu
4624    0.5GiB  /home/ubuntu/StarCraftII/Versions/Base69232/SC2_x64 -listen 127.0.0.1 -port 19296 -dataDir /home/ubu
4626    0.5GiB  /home/ubuntu/StarCraftII/Versions/Base69232/SC2_x64 -listen 127.0.0.1 -port 21827 -dataDir /home/ubu
4627    0.5GiB  /home/ubuntu/StarCraftII/Versions/Base69232/SC2_x64 -listen 127.0.0.1 -port 17905 -dataDir /home/ubu

@sergeivolodin
Copy link
Author

sergeivolodin commented Sep 1, 2020

@jyericlin as a workaround, we just create a separate process for every training step, which opens a checkpoint, does the iteration and then saves a checkpoint (pseudocode below, fully functional example here). The extra time spent on creating process/checkpointing does not seem too bad (for our case!)

while True:
  config_filename = pickle_config(config)

  # starts a new python process from bash
  # important, can't just fork
  # because of tensorflow+fork import issues
  start_process_and_wait(target=train, config_filename)
  results = unpickle_results(config)
  delete_temporary_files()
  if results.iteration > N: break
  config['checkpoint'] = results['checkpoint']

def train(config_filename):
  config = unpickle_config(config_filename)
  trainer = PPO(config)
  trainer.restore(config['checkpoint'])
  results = trainer.train()
  checkpoint = trainer.save()
  results['checkpoint'] = checkpoint
  pickle_results(results)

  # important -- otherwise memory will go up!
  trainer.stop()

Note that in train we also need to reconnect to the existing ray instance to reuse worker processes.

@sergeivolodin
Copy link
Author

@richardliaw any updates? Do you want me to try different tf versions? Thank you.

@richardliaw
Copy link
Contributor

Hmm, could you try the latest Ray wheels (latest snapshot of master) to see if this was fixed?

@sergeivolodin
Copy link
Author

@richardliaw same thing

image

image

image

image

@sergeivolodin
Copy link
Author

sergeivolodin commented Nov 11, 2020

@richardliaw is there a way to attach a Python debugger to a ray worker, preferably from an IDE like PyCharm? I could take a look at where and why the memory is being used

@richardliaw
Copy link
Contributor

There's now a Ray PDB tool: https://docs.ray.io/en/master/ray-debugging.html?highlight=debugger

@jackbart94
Copy link

Is there any update on this issue? I have been dealing with the same.

@azzeddineCH
Copy link

Hi @sergeivolodin, how are you plotting this graphs, please? suspecting that I'm having a similar issue

@sergeivolodin
Copy link
Author

sergeivolodin commented Jan 20, 2021

@azzeddineCH using this custom script:

https://github.com/HumanCompatibleAI/better-adversarial-defenses/tree/master/other/memory_profile

  1. (assuming python3) $ pip install psutil numpy matplotlib humanize inquirer
  2. $ python mem_profile.py -- will write data to a file for all processes for your user and save to mem_out_{username}_{time_start}.txt (will print the filename)
  3. Run $ python mem_analyze.py. For the last one, there are options
  • You can either specify the file --input FILENAME, or the script will ask you which file to open (navigate up-down, use Enter to select)
  • By-default, the script will report usage for ['ray::ImplicitFunc', 'ray::RolloutWorker'], but you can change that either by
    • adding the --customize flag and selecting the processes interactively (navigate up/down, SPACE bar to select, enter to finish selecting)
    • adding program names with --track PROCESS_NAME flag
  • To subtract the initial memory values from all chart values, add --subtract
  • --max_lines only reads that many lines from the file (useful if the first script was running for more time than the monitored program itself)

Good luck!

@ericl ericl added this to the RLlib Bugs milestone Mar 11, 2021
@ericl ericl removed the rllib label Mar 11, 2021
@Bam4d
Copy link
Contributor

Bam4d commented May 6, 2021

I think I'm runing into this too. seems like there are "jumps" in memory at regular intervals for me. The leak is very slow for me, and I'm pretty sure I've ruled out everything else it could be but a bug in MA implementation in RLLib.
image

This is with the latest wheels, python 3.8, pytorch 1.8.0

@Bam4d
Copy link
Contributor

Bam4d commented May 12, 2021

I've done some more digging using tracemalloc and I dont think the bug is actually in the python code as there's no large consistent allocations of python objects. This leaves it to be a potential issue with pytorch or some c++ library, or something to do with how ray handles worker memory.

The bug also does not seem to happen consistently for me across machines. More specifically, on servers where cgroups are enabled I get the memory leak, but on mu local machine I do not. @sven1977 any ideas?
I added a bit more information here: https://discuss.ray.io/t/help-debugging-a-memory-leak-in-rllib/2100

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical
Projects
None yet
7 participants