Problem about distributed training with XLA compiling.

**System information**
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
  - custom layer and custom training step
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
  - I have tested on Windows 10, Ubuntu 16.04, and Ubuntu 18.04.
- TensorFlow installed from (source or binary):
  - both
- TensorFlow version (use command below):
  - I have tried TF 2.4, 2.5 distributed version and source installed 2.4
- Python version:
  - 3.7
- Bazel version (if compiling from source):
  - 3.5.0
- GCC/Compiler version (if compiling from source):
  - 7.5
- CUDA/cuDNN version:
  - 10.1 and 11.0
- GPU model and memory:
  - 1080ti x4

**Describe the current behavior**
When I train my model on multi-gpu with XLA compiling below error is occurred.
```
Training starts
Traceback (most recent call last):
  File "FFP_/train_w_pruning.py", line 76, in <module>
    train_step(*data)
  File "/home/cvip/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 787, in __call__
    result = self._call(*args, **kwds)
  File "/home/cvip/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 854, in _call
    filtered_flat_args, self._concrete_stateful_fn.captured_inputs)  # pylint: disable=protected-access
  File "/home/cvip/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1920, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/cvip/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 561, in call
    ctx=ctx)
  File "/home/cvip/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Trying to access resource ResNet/conv/kernel/replica_1_879 located in device /job:localhost/replica:0/task:0/device:GPU:0 [Op:__inference_train_step_dist_88943]
```

**Describe the expected behavior**
I want to compile my multi-gpu code but it seems unavailable.

**Standalone code to reproduce the issue**
https://github.com/sseung0703/TF2-multi-gpu-training


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem about distributed training with XLA compiling. #45940

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Problem about distributed training with XLA compiling. #45940

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions