Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ResourceVariable save will lead to OOM in distributed mode #20914

Closed
jackonan opened this issue Jul 18, 2018 · 8 comments
Closed

ResourceVariable save will lead to OOM in distributed mode #20914

jackonan opened this issue Jul 18, 2018 · 8 comments
Assignees

Comments

@jackonan
Copy link
Contributor

jackonan commented Jul 18, 2018

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below): 1.4
  • Python version: 2.7.5
  • Bazel version (if compiling from source): 0.9
  • GCC/Compiler version (if compiling from source): 4.8.5
  • CUDA/cuDNN version: 7.5
  • GPU model and memory: None
  • Exact command to reproduce: save ResourceVariable in a distributed mode

Describe the problem

My model is more than 200GB so I run it in distributed mode on CPU, including 1000 workers and 100 ps. All the variables of my model are ResourceVariable partitioned by size, and these variables are placed by the default tf.replica_device_setter. The model is triggered by a MonitoredTrainingSession.

Problem happens when it begins to save model. The memory of ps0 rises to 200GB rapidly and then OOM. I open the log placement and find that all variables are identity to ps0 when running save_op.
In ResourceVariableSaveable, line 182 reset the device, which leads to all save_ops are placed on ps0. I remove this line and re-run, it works correctly.

 168   class ResourceVariableSaveable(SaveableObject):
 169     """SaveableObject implementation that handles ResourceVariables."""
 170
 171     def __init__(self, var, slice_spec, name):
 172       self._var_device = var.device
 173       if isinstance(var, ops.Tensor):
 174         self.handle_op = var.op.inputs[0]
 175         tensor = var
 176       elif isinstance(var, resource_variable_ops.ResourceVariable):
 177
 178         def _read_variable_closure(v):
 179           def f():
 180             with ops.device(v.device):
 181               x = v.read_value()
 182             with ops.device("/device:CPU:0"):
 183               return array_ops.identity(x)
 184           return f
 185
 186         self.handle_op = var.handle
 187         tensor = _read_variable_closure(var)
 188       else:
 189         raise ValueError(
 190             "Saveable is neither a resource variable nor a read operation."
 191             " Got: %s" % repr(var))
 192       spec = BaseSaverBuilder.SaveSpec(tensor, slice_spec, name)
 193       super(BaseSaverBuilder.ResourceVariableSaveable, self).__init__(
 194           var, [spec], name)

Source code / logs

Open the log placement to see all save_ops are placed on ps0. So many such logs

 [2018-07-18 16:02:41.883522] [INFO] [31791] [tensorflow/core/common_runtime/placer.cc:698] Ignoring device specification /device:CPU:0 for node 'save_2/AssignVariableOp_176' because the input edge from 'Optimize/OptimizeLoss/CTR-PositionNetwork/position_hiddenlayer_1/weights/part_7/AdagradDecay_1'       is a reference connection and already has a device field set to /job:ps/task:16

log.txt

@michaelisard
Copy link

Thanks for tracking this down. @agarwal-ashish it looks as if you put in this device assignment would you take a look thanks.

@candyzone
Copy link
Contributor

candyzone commented Jul 19, 2018

@agarwal-ashish
when I use partitioned ResourceVariable, it will be go to Line176. Line181 'x' is already an tf.identity with "v.device", Line183 changes the device placement using another tf.identity. It makes the Identity OP placement with "/device:CPU:0", after build graph in python, TF will get the candidate Devices for Identity OP in c++, finally select the default device "/job:ps/replica:0/task:0/device:CPU:0" (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/placer.cc#L861), the same process happen in SaveV2 OP. It results OP placement imbalance.
Only job:ps/task:0 has a SaveV2 OP, though I use the API tf.train.Saver(sharded=True).
The API tf.train.Saver(sharded=True) works well with Variable and non-partitioned ResourceVariable.
Is it by design?

@agarwal-ashish
Copy link

This intention here is to handle checkpointing of Variables placed on GPU when running with explicit device placement policy in Eager mode. I am submitting a fix that copies it to the CPU on the same machine instead of overriding the device to "/device:CPU:0" which might place it on a different job.

tensorflow-copybara pushed a commit that referenced this issue Jul 19, 2018
…cate the

copy of the variable on the same machine. Addresses Issue #20914.

PiperOrigin-RevId: 205317119
@jackonan
Copy link
Contributor Author

@agarwal-ashish thanks for reply. I make a change in my local repo. It works well and looks simple. Here is the PR #20985. Maybe you can check whether it is helpful.

@agarwal-ashish
Copy link

Did you test if the code after commit# 8f130ff work for you ?

@jackonan
Copy link
Contributor Author

jackonan commented Jul 23, 2018

@agarwal-ashish No, I just test my fix, the indent one. I will test it later.

@jackonan
Copy link
Contributor Author

jackonan commented Jul 23, 2018

@agarwal-ashish The two fixes both work correctly. The only difference is you parse the device info manually and I utilize the context.

@agarwal-ashish
Copy link

Thanks for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants