Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set reshuffle_each_iteration in Dataset.shuffle() directly to True to avoid confusion #62782

Merged
merged 4 commits into from Mar 13, 2024

Conversation

DanielYang59
Copy link
Contributor

@DanielYang59 DanielYang59 commented Jan 11, 2024

Summary

Set the default reshuffle_each_iteration value in tf.data.Dataset.shuffle method to True directly (previous was None but was interpreted to True) to raise awareness of possible silent data leakage, see discussion #59279.

Details (copied and cleaned up from #59279)

The Dataset.shuffle method might lead to dangerously silent data leakage:

In the Dataset doc under the shuffle method, it reads:

shuffle(
buffer_size, seed=None, reshuffle_each_iteration=None, name=None
)

where the reshuffle_each_iteration=None is super misleading, as it be easily misinterpreted that reshuffled is off by default (which is not true at all).

The shuffle method, which called shuffle_op._shuffle:

return shuffle_op._shuffle( # pylint: disable=protected-access
self, buffer_size, seed, reshuffle_each_iteration, name=name)

which then called _ShuffleDataset:

def _shuffle( # pylint: disable=unused-private-name
input_dataset,
buffer_size,
seed=None,
reshuffle_each_iteration=None,
name=None):
return _ShuffleDataset(
input_dataset, buffer_size, seed, reshuffle_each_iteration, name=name)

which finally inititate the _ShuffleDataset class:

class _ShuffleDataset(dataset_ops.UnaryUnchangedStructureDataset):
"""A `Dataset` that randomly shuffles the elements of its input."""
def __init__(self,
input_dataset,
buffer_size,
seed=None,
reshuffle_each_iteration=None,
name=None):
"""See `Dataset.shuffle()` for details."""
self._input_dataset = input_dataset
self._buffer_size = ops.convert_to_tensor(
buffer_size, dtype=dtypes.int64, name="buffer_size")
self._seed, self._seed2 = random_seed.get_seed(seed)
if reshuffle_each_iteration is None:
reshuffle_each_iteration = True

has the following dangerous definition:

if reshuffle_each_iteration is None:
      reshuffle_each_iteration = True

As a result, the default reshuffle_each_iteration = None would be interpreted to reshuffle_each_iteration = True (which is truly unexpected by user).

TODO list:

  • Set the default value of reshuffle_each_iteration directly to True
  • Add warning in docs about possible data leakage related to reshuffle_each_iteration = True
    Issue warning when training/validation datasets are split by using the shuffle + take/skip pattern (scheduled for a separate follow-up PR)

@google-ml-butler google-ml-butler bot added the size:XS CL Change Size: Extra Small label Jan 11, 2024
@DanielYang59 DanielYang59 changed the title Set reshuffle_each_iteration in Dataset.shuffle() directly to True Set reshuffle_each_iteration in Dataset.shuffle() directly to True to avoid possible data leakage Jan 11, 2024
@DanielYang59 DanielYang59 marked this pull request as ready for review January 12, 2024 01:32
@DanielYang59
Copy link
Contributor Author

DanielYang59 commented Jan 12, 2024

I think I might keep the 3rd task to a separate PR, as I'm still working on a proper way to track if dataset has undergone shuffle(reshuffle_each_iteration=True) before take and skip. And this PR would not change any behaviour as far as I know. Can you please review? @aaudiber @gbaned

@gbaned gbaned added the comp:data tf.data related issues label Jan 12, 2024
@gbaned gbaned added this to Assigned Reviewer in PR Queue via automation Jan 12, 2024
@google-ml-butler google-ml-butler bot added the awaiting review Pull request awaiting review label Jan 12, 2024
Copy link
Contributor

@aaudiber aaudiber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer Jan 12, 2024
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Jan 12, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Jan 12, 2024
@aaudiber
Copy link
Contributor

Test is failing because we need to update the golden files with the API change:

member_method {
name: "shuffle"
argspec: "args=[\'self\', \'buffer_size\', \'seed\', \'reshuffle_each_iteration\', \'name\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\'], "
}

member_method {
name: "shuffle"
argspec: "args=[\'self\', \'buffer_size\', \'seed\', \'reshuffle_each_iteration\', \'name\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\'], "
}

Please updates the Nones to True there as well

@google-ml-butler google-ml-butler bot removed the ready to pull PR ready for merge process label Jan 13, 2024
@DanielYang59
Copy link
Contributor Author

Thanks for reviewing and the comments @aaudiber . I'm not quite aware of the API and thanks for the input!

@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Jan 16, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Jan 16, 2024
@DanielYang59
Copy link
Contributor Author

Hi @aaudiber. Thanks for reviewing. Also I noticed some tests are failing (ROCm/MacOS CPU and such) though they don't have a Required tag. Should I be concerned?

@DanielYang59
Copy link
Contributor Author

This PR has been approved and taged ready to pull for over a month now. Is there anything I should do? Thanks! @gbaned @aaudiber

@gbaned gbaned removed awaiting review Pull request awaiting review ready to pull PR ready for merge process labels Feb 20, 2024
@gbaned gbaned added the ready to pull PR ready for merge process label Feb 20, 2024
@DanielYang59 DanielYang59 changed the title Set reshuffle_each_iteration in Dataset.shuffle() directly to True to avoid possible data leakage Set reshuffle_each_iteration in Dataset.shuffle() directly to True to avoid confusion Mar 7, 2024
copybara-service bot pushed a commit that referenced this pull request Mar 13, 2024
FUTURE_COPYBARA_INTEGRATE_REVIEW=#62782 from DanielYang59:warn-shuffle 134fc24
PiperOrigin-RevId: 615218481
copybara-service bot pushed a commit that referenced this pull request Mar 13, 2024
This thunk wraps the logic to compute dynamic offsets/sizes from dynamic-slice and DUS around some original thunks (e.g. custom call or NCCL thunks)

FUTURE_COPYBARA_INTEGRATE_REVIEW=#62782 from DanielYang59:warn-shuffle 134fc24
PiperOrigin-RevId: 615492753
copybara-service bot pushed a commit that referenced this pull request Mar 13, 2024
…nment file.

FUTURE_COPYBARA_INTEGRATE_REVIEW=#62782 from DanielYang59:warn-shuffle 134fc24
PiperOrigin-RevId: 614792083
copybara-service bot pushed a commit that referenced this pull request Mar 13, 2024
…ctly to `True` to avoid confusion

Imported from GitHub PR #62782

## Summary

Set the default `reshuffle_each_iteration` value in `tf.data.Dataset.shuffle` method to `True` directly (previous was `None` but was interpreted to `True`) to raise awareness of possible silent data leakage, see discussion #59279.

## Details (copied and cleaned up from #59279)

The `Dataset.shuffle` method might lead to **dangerously silent** data leakage:

In the [Dataset doc](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) under the `shuffle` method, it reads:

> shuffle(
>     buffer_size, seed=None, reshuffle_each_iteration=None, name=None
> )

where the `reshuffle_each_iteration=None` is **super misleading**, as it be easily misinterpreted that reshuffled is off by default (**which is not true at all**).

The `shuffle` method, which called `shuffle_op._shuffle`:
https://github.com/tensorflow/tensorflow/blob/4dacf3f368eb7965e9b5c3bbdd5193986081c3b2/tensorflow/python/data/ops/dataset_ops.py#L1472-L1473

which then called `_ShuffleDataset`:
  https://github.com/tensorflow/tensorflow/blob/b756c44e3f3ed52ccb4f05736569b95f4481eea0/tensorflow/python/data/ops/shuffle_op.py#L25-L32

which finally inititate the `_ShuffleDataset` class:
https://github.com/tensorflow/tensorflow/blob/b756c44e3f3ed52ccb4f05736569b95f4481eea0/tensorflow/python/data/ops/shuffle_op.py#L35-L50

has the following dangerous definition:
```
if reshuffle_each_iteration is None:
      reshuffle_each_iteration = True
```

As a result, the default `reshuffle_each_iteration = None` would be interpreted to `reshuffle_each_iteration = True` (which is truly unexpected by user).

## TODO list:
- [X] Set the default value of `reshuffle_each_iteration` directly to `True`
- [x] Add warning in docs about possible data leakage related to `reshuffle_each_iteration = True`
~~Issue warning when training/validation datasets are split by using the `shuffle + take/skip` pattern~~ (scheduled for a separate follow-up PR)

Copybara import of the project:

--
ec9557c by Haoyu (Daniel) <yanghaoyu97@outlook.com>:

set `reshuffle_each_iteration` directly to `True`

--
efbc98e by Haoyu (Daniel) <yanghaoyu97@outlook.com>:

add warning in docstring

--
fcf84be by Haoyu (Daniel) <yanghaoyu97@outlook.com>:

revise wording

--
134fc24 by Haoyu (Daniel) <yanghaoyu97@outlook.com>:

change default values in API golden files

Merging this change closes #62782

FUTURE_COPYBARA_INTEGRATE_REVIEW=#62782 from DanielYang59:warn-shuffle 134fc24
PiperOrigin-RevId: 615518669
@copybara-service copybara-service bot merged commit eedefdd into tensorflow:master Mar 13, 2024
9 of 13 checks passed
PR Queue automation moved this from Approved by Reviewer to Merged Mar 13, 2024
copybara-service bot pushed a commit that referenced this pull request Mar 13, 2024
Send-Recv pipeling and record the decision with frontend attributes.

We first use a simple heuristics to decide on the decomposition of which
CollectivePermute operations will be pipelined. We will only pipeline
CollectivePermute that sends loop input data, and pick the first
pipelineable CollectivePermute for pipelining. Then, if there is another
pipelineable CollectivePermute that forms a cycle with the to-be-pipelined
CollectivePermute, we will pipeline both CollectivePermute. Otherwise, we will
only pipeline one CollectivePermute.

Then, when we decompose CollectivePermute operations, we add a frontend
attribute to the Send/Recv operation to represent the pipelining decision.

Add tests.

FUTURE_COPYBARA_INTEGRATE_REVIEW=#62782 from DanielYang59:warn-shuffle 134fc24
PiperOrigin-RevId: 614419535
copybara-service bot pushed a commit that referenced this pull request Mar 13, 2024
…ments.

∘ :: (b -> c) -> (a -> b) -> (a -> c)

FUTURE_COPYBARA_INTEGRATE_REVIEW=#62782 from DanielYang59:warn-shuffle 134fc24
PiperOrigin-RevId: 615527130
@DanielYang59 DanielYang59 deleted the warn-shuffle branch March 14, 2024 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:data tf.data related issues ready to pull PR ready for merge process size:XS CL Change Size: Extra Small
Projects
PR Queue
  
Merged
Development

Successfully merging this pull request may close these issues.

None yet

4 participants