Skip to content

Problem about distributed training with XLA compiling. #45940

@sseung0703

Description

@sseung0703

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    • custom layer and custom training step
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    • I have tested on Windows 10, Ubuntu 16.04, and Ubuntu 18.04.
  • TensorFlow installed from (source or binary):
    • both
  • TensorFlow version (use command below):
    • I have tried TF 2.4, 2.5 distributed version and source installed 2.4
  • Python version:
    • 3.7
  • Bazel version (if compiling from source):
    • 3.5.0
  • GCC/Compiler version (if compiling from source):
    • 7.5
  • CUDA/cuDNN version:
    • 10.1 and 11.0
  • GPU model and memory:
    • 1080ti x4

Describe the current behavior
When I train my model on multi-gpu with XLA compiling below error is occurred.

Training starts
Traceback (most recent call last):
  File "FFP_/train_w_pruning.py", line 76, in <module>
    train_step(*data)
  File "/home/cvip/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 787, in __call__
    result = self._call(*args, **kwds)
  File "/home/cvip/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 854, in _call
    filtered_flat_args, self._concrete_stateful_fn.captured_inputs)  # pylint: disable=protected-access
  File "/home/cvip/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1920, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/cvip/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 561, in call
    ctx=ctx)
  File "/home/cvip/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Trying to access resource ResNet/conv/kernel/replica_1_879 located in device /job:localhost/replica:0/task:0/device:GPU:0 [Op:__inference_train_step_dist_88943]

Describe the expected behavior
I want to compile my multi-gpu code but it seems unavailable.

Standalone code to reproduce the issue
https://github.com/sseung0703/TF2-multi-gpu-training

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions