Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReverbAddTrajectoryObserver gives "The number of pending items is alarmingly high" error #853

Open
adiaconu11 opened this issue Aug 15, 2023 · 0 comments

Comments

@adiaconu11
Copy link

Hello everyone,

I am struggling to fix an issue with the ReverbAddTrajectoryObserver. I am constantly getting this error:
[reverb/cc/trajectory_writer.cc:655] The number of pending items is alarmingly high, did you forget to call Flush? 130 items are waiting to be sent and 8 items have been sent to the server but haven't been confirmed yet. It is important to call Flush regularly as large numbers of pending items can result in OOM crashes on both client and server.

The code that I was using is very similar to the one by Schulman in tf_agents/examples/ppo/schulman17/train_eval_lib.py:

    replay_buffer_capacity=1000000
    replay_seq_len=100
    batch_size=128
    reverb_server = reverb.Server(
        [
            reverb.Table( # Replay buffer for training experience
                name="training_table",
                sampler=reverb.selectors.Uniform(),
                remover=reverb.selectors.Fifo(),
                rate_limiter=reverb.rate_limiters.MinSize(1),
                max_times_sampled=1,
                max_size=replay_buffer_capacity,
            ),
            reverb.Table(
                name="normalization_table",
                sampler=reverb.selectors.Uniform(),
                remover=reverb.selectors.Fifo(),
                rate_limiter=reverb.rate_limiters.MinSize(1),
                max_times_sampled=1,
                max_size=replay_buffer_capacity,
            )
        ]
    )

    reverb_replay_train = reverb_replay_buffer.ReverbReplayBuffer(
        tf_agent.collect_data_spec,
        sequence_length=replay_seq_len,
        table_name='training_table',
        server_address='localhost:{}'.format(reverb_server.port),
        # The only collected sequence is used to populate the batches.
        dataset_buffer_size=10*batch_size,
        max_cycle_length=1,
        rate_limiter_timeout_ms=1000,
        num_workers_per_iterator=6,
    )
    reverb_replay_normalization = reverb_replay_buffer.ReverbReplayBuffer(
        tf_agent.collect_data_spec,
        sequence_length=replay_seq_len,
        table_name='normalization_table',
        server_address='localhost:{}'.format(reverb_server.port),
        # The only collected sequence is used to populate the batches.
        dataset_buffer_size=10*batch_size,
        max_cycle_length=1,
        rate_limiter_timeout_ms=1000,
        num_workers_per_iterator=6,
    )

    rb_observer = reverb_utils.ReverbAddTrajectoryObserver(
        reverb_replay_train.py_client,
        ['training_table', 'normalization_table'],
        sequence_length=replay_seq_len,
    )
  # [Omitted code]

  collect_actor = actor.Actor(
        train_py_env,
        collect_policy,
        train_step,
        steps_per_run=2*batch_size*replay_seq_len,
        observers=[rb_observer],
        metrics=actor.collect_metrics(buffer_size=10)
    )

From my understanding, the problem is that the reverb server doesn't keep up with rate at which items are added to it, and thus I need to flush rb_observer constantly. I have added self.flush() in the call method of the ReverbAddTrajectoryObserver, but that makes the code 3x slower. Is there another way of fixing this? I am running all the code on a large server, so memory and cpu cores are not an issue. I just hate it that my error files have gigabyte dimensions because of the above error message. Also, the number of items waiting is always less than 250, so I tried bumping the limit up (although looks like I can't because trajectory_writer.cc is a C file that seems to get compiled upon installing the library...

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant