Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential memory issue with tf_py_environment #8

Closed
ChengshuLi opened this issue Feb 13, 2019 · 7 comments
Closed

Potential memory issue with tf_py_environment #8

ChengshuLi opened this issue Feb 13, 2019 · 7 comments
Assignees

Comments

@ChengshuLi
Copy link

ChengshuLi commented Feb 13, 2019

Hi,

First of all, my environment is the following:
Tensorflow version: 1.13.0-dev20190205 (pip install tf-nightly-gpu)
tf-agents version: 0.2.0.dev20190123 (pip install tf-agents-nightly)
CUDA version: 10.0
cuDNN version: 7.4.1
Ubuntu version: 16.04

When I wrapped my customized python environment using tf_py_environment, it seemed to consume more and more cpu memory as time passed until the memory ran out and the program got stuck. This problem is particularly evident if my observation is large (say a RGB image or a large vector).

Here is a toy example:

import tensorflow as tf
from tf_agents.environments import tf_py_environment, py_environment
from tf_agents.specs import array_spec
from tf_agents.environments import time_step as ts
import numpy as np

img_size = 5000

class MyEnv(py_environment.Base):
    def __init__(self):
        self._action_spec = array_spec.BoundedArraySpec(shape=(), dtype=np.float32)
        self._observation_spec = array_spec.BoundedArraySpec(shape=(img_size, img_size, 3), dtype=np.float32)

    def action_spec(self):
        return self._action_spec

    def observation_spec(self):
        return self._observation_spec

    def reset(self):
        return ts.restart(np.zeros(shape=(img_size, img_size, 3), dtype=np.float32))

    def step(self, action):
        return ts.transition(np.zeros(shape=(img_size, img_size, 3), dtype=np.float32), reward=0.0, discount=1.0)

tf_py_env = MyEnv()
tf_env = tf_py_environment.TFPyEnvironment(tf_py_env)
i = 0
while True:
    if i % 10000 == 0:
        print(i)
        tf_env.reset()
    action = tf.constant([0.0])
    time_step = tf_env.step(action)
    i += 1

After a few minutes of running, it drained almost all the memory until the program got stuck. The last print out is 850000.
image

I have also run tf_agents/agents/dqn/examples/train_eval_atari.py for a while and it has the same symptom.
image
image
The memory fluctuated between 40% - 90% and due to the time / computing limit, I didn't get the chance to run it until convergence or crash / getting stuck.

In both cases, running the program makes my machine pretty slow. Is this expected?

I am very new to tf-agents so I suspect I did something wrong (maybe I am supposed to free memory somewhere in my code?). I would really appreciate if someone could point me to the right direction. Thanks!

Eric

@oars
Copy link
Contributor

oars commented Feb 13, 2019

Hi Eric,

Note that you are using a tf_environment. The way your code is structured you are generating Tensorflow Ops, but not evaluating them which is causing the increase in memory usage.

You'll want to change the later segment of your code to be:

tf_py_env = MyEnv()
tf_env = tf_py_environment.TFPyEnvironment(tf_py_env)
action = tf.constant([0.0])
reset_op = tf_env.reset()
step_op = tf_env.step(action)

i = 0

with tf.Session() as sess:
  while True:
      if i % 10000 == 0:
          print(i)
          time_step = sess.run(reset_op)
      time_step = sess.run(step_op)
      i += 1

@oars oars closed this as completed Feb 13, 2019
@ChengshuLi
Copy link
Author

@oars

Thank you so much for your help! It completely makes sense.

I think I was just following agents/tf_agents/colabs/environments_tutorial.ipynb without thinking too much about it and also overlooked tf.enable_eager_execution().

Thanks again.

@ChengshuLi
Copy link
Author

ChengshuLi commented Feb 13, 2019

@oars After running your code, I actually got an error:

RuntimeError: The Session graph is empty.  Add operations to the graph before calling run().

Do you know how I can fix this?

Also, as I mentioned above, when I ran the code tf_agents/agents/dqn/examples/train_eval_atari.py with no modification, the memory usage still fluctuated between 40% to 80+%, in which case my computer also becomes slower. Is that to be expected?

Thanks!

@oars oars self-assigned this Feb 19, 2019
@oars
Copy link
Contributor

oars commented Feb 19, 2019

Can you try updating tf-agents and trying again? I can't reproduce your error. For reference I just re-ran with this code:

import tensorflow as tf
from tf_agents.environments import tf_py_environment, py_environment
from tf_agents.specs import array_spec
from tf_agents.environments import time_step as ts
import numpy as np

img_size = 5000

class MyEnv(py_environment.Base):
    def __init__(self):
        self._action_spec = array_spec.BoundedArraySpec(shape=(), dtype=np.float32)
        self._observation_spec = array_spec.BoundedArraySpec(shape=(img_size, img_size, 3), dtype=np.float32)

    def action_spec(self):
        return self._action_spec

    def observation_spec(self):
        return self._observation_spec

    def _reset(self):
        return ts.restart(np.zeros(shape=(img_size, img_size, 3), dtype=np.float32))

    def _step(self, action):
        return ts.transition(np.zeros(shape=(img_size, img_size, 3), dtype=np.float32), reward=0.0, discount=1.0)

tf_py_env = MyEnv()
tf_env = tf_py_environment.TFPyEnvironment(tf_py_env)
action = tf.constant([0.0])
reset_op = tf_env.reset()
step_op = tf_env.step(action)

i = 0

with tf.Session() as sess:
  while True:
      if i % 100 == 0:
          print(i)
          time_step = sess.run(reset_op)
      time_step = sess.run(step_op)
      i += 1

Regarding the memory fluctuations I wouldn't expect that to happen either. Note that your image is fairly large ~280MB as raw float32 so if there are a couple of internal instances of it memory usage will be large.

@ChengshuLi
Copy link
Author

It works!

I used tf.enable_eager_execution() before and it didn't work.
Sorry I should have spent more time understanding tf eager mode and this error should be obvious. Thanks a lot.

@liujuncn
Copy link

@oars How to deal with it in tf 2.0 without tf.Session ?

@oars
Copy link
Contributor

oars commented Nov 18, 2019

How to deal with what? There are examples using environments in 2.0. Please look at the colabs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants