ResourceVariable save will lead to OOM in distributed mode #20914

jackonan · 2018-07-18T09:28:59Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux
TensorFlow installed from (source or binary): source
TensorFlow version (use command below): 1.4
Python version: 2.7.5
Bazel version (if compiling from source): 0.9
GCC/Compiler version (if compiling from source): 4.8.5
CUDA/cuDNN version: 7.5
GPU model and memory: None
Exact command to reproduce: save ResourceVariable in a distributed mode

Describe the problem

My model is more than 200GB so I run it in distributed mode on CPU, including 1000 workers and 100 ps. All the variables of my model are ResourceVariable partitioned by size, and these variables are placed by the default tf.replica_device_setter. The model is triggered by a MonitoredTrainingSession.

Problem happens when it begins to save model. The memory of ps0 rises to 200GB rapidly and then OOM. I open the log placement and find that all variables are identity to ps0 when running save_op.
In ResourceVariableSaveable, line 182 reset the device, which leads to all save_ops are placed on ps0. I remove this line and re-run, it works correctly.

 168   class ResourceVariableSaveable(SaveableObject):
 169     """SaveableObject implementation that handles ResourceVariables."""
 170
 171     def __init__(self, var, slice_spec, name):
 172       self._var_device = var.device
 173       if isinstance(var, ops.Tensor):
 174         self.handle_op = var.op.inputs[0]
 175         tensor = var
 176       elif isinstance(var, resource_variable_ops.ResourceVariable):
 177
 178         def _read_variable_closure(v):
 179           def f():
 180             with ops.device(v.device):
 181               x = v.read_value()
 182             with ops.device("/device:CPU:0"):
 183               return array_ops.identity(x)
 184           return f
 185
 186         self.handle_op = var.handle
 187         tensor = _read_variable_closure(var)
 188       else:
 189         raise ValueError(
 190             "Saveable is neither a resource variable nor a read operation."
 191             " Got: %s" % repr(var))
 192       spec = BaseSaverBuilder.SaveSpec(tensor, slice_spec, name)
 193       super(BaseSaverBuilder.ResourceVariableSaveable, self).__init__(
 194           var, [spec], name)

Source code / logs

Open the log placement to see all save_ops are placed on ps0. So many such logs

 [2018-07-18 16:02:41.883522] [INFO] [31791] [tensorflow/core/common_runtime/placer.cc:698] Ignoring device specification /device:CPU:0 for node 'save_2/AssignVariableOp_176' because the input edge from 'Optimize/OptimizeLoss/CTR-PositionNetwork/position_hiddenlayer_1/weights/part_7/AdagradDecay_1'       is a reference connection and already has a device field set to /job:ps/task:16

log.txt

The text was updated successfully, but these errors were encountered:

michaelisard · 2018-07-18T19:40:57Z

Thanks for tracking this down. @agarwal-ashish it looks as if you put in this device assignment would you take a look thanks.

candyzone · 2018-07-19T02:46:20Z

@agarwal-ashish
when I use partitioned ResourceVariable, it will be go to Line176. Line181 'x' is already an tf.identity with "v.device", Line183 changes the device placement using another tf.identity. It makes the Identity OP placement with "/device:CPU:0", after build graph in python, TF will get the candidate Devices for Identity OP in c++, finally select the default device "/job:ps/replica:0/task:0/device:CPU:0" (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/placer.cc#L861), the same process happen in SaveV2 OP. It results OP placement imbalance.
Only job:ps/task:0 has a SaveV2 OP, though I use the API tf.train.Saver(sharded=True).
The API tf.train.Saver(sharded=True) works well with Variable and non-partitioned ResourceVariable.
Is it by design?

agarwal-ashish · 2018-07-19T21:20:06Z

This intention here is to handle checkpointing of Variables placed on GPU when running with explicit device placement policy in Eager mode. I am submitting a fix that copies it to the CPU on the same machine instead of overriding the device to "/device:CPU:0" which might place it on a different job.

…cate the copy of the variable on the same machine. Addresses Issue #20914. PiperOrigin-RevId: 205317119

jackonan · 2018-07-20T03:08:58Z

@agarwal-ashish thanks for reply. I make a change in my local repo. It works well and looks simple. Here is the PR #20985. Maybe you can check whether it is helpful.

agarwal-ashish · 2018-07-20T18:59:28Z

Did you test if the code after commit# 8f130ff work for you ?

jackonan · 2018-07-23T01:10:45Z

@agarwal-ashish No, I just test my fix, the indent one. I will test it later.

jackonan · 2018-07-23T02:35:25Z

@agarwal-ashish The two fixes both work correctly. The only difference is you parse the device info manually and I utilize the context.

agarwal-ashish · 2018-07-23T21:31:53Z

Thanks for the fix!

tensorflowbutler assigned michaelisard Jul 18, 2018

michaelisard assigned agarwal-ashish and unassigned michaelisard Jul 18, 2018

tensorflow-copybara pushed a commit that referenced this issue Jul 19, 2018

Fix ResourceVariable placement during checkpointing to correctly colo…

8f130ff

…cate the copy of the variable on the same machine. Addresses Issue #20914. PiperOrigin-RevId: 205317119

agarwal-ashish closed this as completed Jul 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ResourceVariable save will lead to OOM in distributed mode #20914

ResourceVariable save will lead to OOM in distributed mode #20914

jackonan commented Jul 18, 2018 •

edited

michaelisard commented Jul 18, 2018

candyzone commented Jul 19, 2018 •

edited

agarwal-ashish commented Jul 19, 2018

jackonan commented Jul 20, 2018

agarwal-ashish commented Jul 20, 2018

jackonan commented Jul 23, 2018 •

edited

jackonan commented Jul 23, 2018 •

edited

agarwal-ashish commented Jul 23, 2018

ResourceVariable save will lead to OOM in distributed mode #20914

ResourceVariable save will lead to OOM in distributed mode #20914

Comments

jackonan commented Jul 18, 2018 • edited

System information

Describe the problem

Source code / logs

michaelisard commented Jul 18, 2018

candyzone commented Jul 19, 2018 • edited

agarwal-ashish commented Jul 19, 2018

jackonan commented Jul 20, 2018

agarwal-ashish commented Jul 20, 2018

jackonan commented Jul 23, 2018 • edited

jackonan commented Jul 23, 2018 • edited

agarwal-ashish commented Jul 23, 2018

jackonan commented Jul 18, 2018 •

edited

candyzone commented Jul 19, 2018 •

edited

jackonan commented Jul 23, 2018 •

edited

jackonan commented Jul 23, 2018 •

edited