Inconsistency in CPU results and GPU results in model training #67137
Labels
comp:gpu
GPU related issues
stat:awaiting response
Status - Awaiting response from author
TF 2.13
For issues related to Tensorflow 2.13
type:bug
Bug
Issue type
Bug
Have you reproduced the bug with TensorFlow Nightly?
No
Source
binary
TensorFlow version
tf 2.13.0
Custom code
Yes
OS platform and distribution
Linux Ubuntu 20.04.5
Mobile device
No response
Python version
3.8.10
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
11.8/8.7
GPU model and memory
No response
Current behavior?
I am reporting an issue encountered during the distributed training of a model with different types of devices using tensorflow. I initially encountered the bug with multiple GPUs involved, but reproduced the bug in a single GPU case. The version of tensorflow used is 2.13.0.
It is very likely an edge case. This inconsistency only occurs with the specific initial weights and inputs we provided. To make the difference more apparent given a limited amount of training data, we deliberately chose a relatively high learning rate (lr=10.0).
Before executing the code, put the model inside the same directory of the reproduce code, so that the model weights can be loaded. It is important to load the model weights, as random initial weights cannot reproduce this bug.
Standalone code to reproduce the issue
Relevant log output
The text was updated successfully, but these errors were encountered: