New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set reshuffle_each_iteration
in Dataset.shuffle()
directly to True
to avoid confusion
#62782
Set reshuffle_each_iteration
in Dataset.shuffle()
directly to True
to avoid confusion
#62782
Conversation
reshuffle_each_iteration
in Dataset.shuffle()
directly to True
reshuffle_each_iteration
in Dataset.shuffle()
directly to True
to avoid possible data leakage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Test is failing because we need to update the golden files with the API change: tensorflow/tensorflow/tools/api/golden/v1/tensorflow.data.-dataset.pbtxt Lines 190 to 193 in fb15a8c
tensorflow/tensorflow/tools/api/golden/v2/tensorflow.data.-dataset.pbtxt Lines 157 to 160 in fb15a8c
Please updates the |
Thanks for reviewing and the comments @aaudiber . I'm not quite aware of the API and thanks for the input! |
Hi @aaudiber. Thanks for reviewing. Also I noticed some tests are failing (ROCm/MacOS CPU and such) though they don't have a |
reshuffle_each_iteration
in Dataset.shuffle()
directly to True
to avoid possible data leakagereshuffle_each_iteration
in Dataset.shuffle()
directly to True
to avoid confusion
…ctly to `True` to avoid confusion Imported from GitHub PR #62782 ## Summary Set the default `reshuffle_each_iteration` value in `tf.data.Dataset.shuffle` method to `True` directly (previous was `None` but was interpreted to `True`) to raise awareness of possible silent data leakage, see discussion #59279. ## Details (copied and cleaned up from #59279) The `Dataset.shuffle` method might lead to **dangerously silent** data leakage: In the [Dataset doc](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) under the `shuffle` method, it reads: > shuffle( > buffer_size, seed=None, reshuffle_each_iteration=None, name=None > ) where the `reshuffle_each_iteration=None` is **super misleading**, as it be easily misinterpreted that reshuffled is off by default (**which is not true at all**). The `shuffle` method, which called `shuffle_op._shuffle`: https://github.com/tensorflow/tensorflow/blob/4dacf3f368eb7965e9b5c3bbdd5193986081c3b2/tensorflow/python/data/ops/dataset_ops.py#L1472-L1473 which then called `_ShuffleDataset`: https://github.com/tensorflow/tensorflow/blob/b756c44e3f3ed52ccb4f05736569b95f4481eea0/tensorflow/python/data/ops/shuffle_op.py#L25-L32 which finally inititate the `_ShuffleDataset` class: https://github.com/tensorflow/tensorflow/blob/b756c44e3f3ed52ccb4f05736569b95f4481eea0/tensorflow/python/data/ops/shuffle_op.py#L35-L50 has the following dangerous definition: ``` if reshuffle_each_iteration is None: reshuffle_each_iteration = True ``` As a result, the default `reshuffle_each_iteration = None` would be interpreted to `reshuffle_each_iteration = True` (which is truly unexpected by user). ## TODO list: - [X] Set the default value of `reshuffle_each_iteration` directly to `True` - [x] Add warning in docs about possible data leakage related to `reshuffle_each_iteration = True` ~~Issue warning when training/validation datasets are split by using the `shuffle + take/skip` pattern~~ (scheduled for a separate follow-up PR) Copybara import of the project: -- ec9557c by Haoyu (Daniel) <yanghaoyu97@outlook.com>: set `reshuffle_each_iteration` directly to `True` -- efbc98e by Haoyu (Daniel) <yanghaoyu97@outlook.com>: add warning in docstring -- fcf84be by Haoyu (Daniel) <yanghaoyu97@outlook.com>: revise wording -- 134fc24 by Haoyu (Daniel) <yanghaoyu97@outlook.com>: change default values in API golden files Merging this change closes #62782 FUTURE_COPYBARA_INTEGRATE_REVIEW=#62782 from DanielYang59:warn-shuffle 134fc24 PiperOrigin-RevId: 615518669
eedefdd
into
tensorflow:master
Send-Recv pipeling and record the decision with frontend attributes. We first use a simple heuristics to decide on the decomposition of which CollectivePermute operations will be pipelined. We will only pipeline CollectivePermute that sends loop input data, and pick the first pipelineable CollectivePermute for pipelining. Then, if there is another pipelineable CollectivePermute that forms a cycle with the to-be-pipelined CollectivePermute, we will pipeline both CollectivePermute. Otherwise, we will only pipeline one CollectivePermute. Then, when we decompose CollectivePermute operations, we add a frontend attribute to the Send/Recv operation to represent the pipelining decision. Add tests. FUTURE_COPYBARA_INTEGRATE_REVIEW=#62782 from DanielYang59:warn-shuffle 134fc24 PiperOrigin-RevId: 614419535
Summary
Set the default
reshuffle_each_iteration
value intf.data.Dataset.shuffle
method toTrue
directly (previous wasNone
but was interpreted toTrue
) to raise awareness of possible silent data leakage, see discussion #59279.Details (copied and cleaned up from #59279)
The
Dataset.shuffle
method might lead to dangerously silent data leakage:In the Dataset doc under the
shuffle
method, it reads:where the
reshuffle_each_iteration=None
is super misleading, as it be easily misinterpreted that reshuffled is off by default (which is not true at all).The
shuffle
method, which calledshuffle_op._shuffle
:tensorflow/tensorflow/python/data/ops/dataset_ops.py
Lines 1472 to 1473 in 4dacf3f
which then called
_ShuffleDataset
:tensorflow/tensorflow/python/data/ops/shuffle_op.py
Lines 25 to 32 in b756c44
which finally inititate the
_ShuffleDataset
class:tensorflow/tensorflow/python/data/ops/shuffle_op.py
Lines 35 to 50 in b756c44
has the following dangerous definition:
As a result, the default
reshuffle_each_iteration = None
would be interpreted toreshuffle_each_iteration = True
(which is truly unexpected by user).TODO list:
reshuffle_each_iteration
directly toTrue
reshuffle_each_iteration = True
Issue warning when training/validation datasets are split by using the(scheduled for a separate follow-up PR)shuffle + take/skip
pattern