Set `reshuffle_each_iteration` in `Dataset.shuffle()` directly to `True` to avoid confusion #62782

DanielYang59 · 2024-01-11T11:22:51Z

Summary

Set the default reshuffle_each_iteration value in tf.data.Dataset.shuffle method to True directly (previous was None but was interpreted to True) to raise awareness of possible silent data leakage, see discussion #59279.

Details (copied and cleaned up from #59279)

The Dataset.shuffle method might lead to dangerously silent data leakage:

In the Dataset doc under the shuffle method, it reads:

shuffle(
buffer_size, seed=None, reshuffle_each_iteration=None, name=None
)

where the reshuffle_each_iteration=None is super misleading, as it be easily misinterpreted that reshuffled is off by default (which is not true at all).

The shuffle method, which called shuffle_op._shuffle:

tensorflow/tensorflow/python/data/ops/dataset_ops.py

Lines 1472 to 1473 in 4dacf3f

    
           return shuffle_op._shuffle(  # pylint: disable=protected-access 
        
               self, buffer_size, seed, reshuffle_each_iteration, name=name)

which then called _ShuffleDataset:

tensorflow/tensorflow/python/data/ops/shuffle_op.py

Lines 25 to 32 in b756c44

    
           def _shuffle(  # pylint: disable=unused-private-name 
        
               input_dataset, 
        
               buffer_size, 
        
               seed=None, 
        
               reshuffle_each_iteration=None, 
        
               name=None): 
        
             return _ShuffleDataset( 
        
                 input_dataset, buffer_size, seed, reshuffle_each_iteration, name=name)

which finally inititate the _ShuffleDataset class:

tensorflow/tensorflow/python/data/ops/shuffle_op.py

Lines 35 to 50 in b756c44

    
           class _ShuffleDataset(dataset_ops.UnaryUnchangedStructureDataset): 
        
             """A `Dataset` that randomly shuffles the elements of its input.""" 
        
             def __init__(self, 
        
                          input_dataset, 
        
                          buffer_size, 
        
                          seed=None, 
        
                          reshuffle_each_iteration=None, 
        
                          name=None): 
        
               """See `Dataset.shuffle()` for details.""" 
        
               self._input_dataset = input_dataset 
        
               self._buffer_size = ops.convert_to_tensor( 
        
                   buffer_size, dtype=dtypes.int64, name="buffer_size") 
        
               self._seed, self._seed2 = random_seed.get_seed(seed) 
        
               if reshuffle_each_iteration is None: 
        
                 reshuffle_each_iteration = True

has the following dangerous definition:

if reshuffle_each_iteration is None:
      reshuffle_each_iteration = True

As a result, the default reshuffle_each_iteration = None would be interpreted to reshuffle_each_iteration = True (which is truly unexpected by user).

TODO list:

Set the default value of reshuffle_each_iteration directly to True
Add warning in docs about possible data leakage related to reshuffle_each_iteration = True
~~Issue warning when training/validation datasets are split by using the shuffle + take/skip pattern~~ (scheduled for a separate follow-up PR)

DanielYang59 · 2024-01-12T01:32:53Z

I think I might keep the 3rd task to a separate PR, as I'm still working on a proper way to track if dataset has undergone shuffle(reshuffle_each_iteration=True) before take and skip. And this PR would not change any behaviour as far as I know. Can you please review? @aaudiber @gbaned

aaudiber

Thank you!

aaudiber · 2024-01-12T16:24:05Z

Test is failing because we need to update the golden files with the API change:

tensorflow/tensorflow/tools/api/golden/v1/tensorflow.data.-dataset.pbtxt

Lines 190 to 193 in fb15a8c

    
             member_method { 
        
               name: "shuffle" 
        
               argspec: "args=[\'self\', \'buffer_size\', \'seed\', \'reshuffle_each_iteration\', \'name\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\'], " 
        
             }

tensorflow/tensorflow/tools/api/golden/v2/tensorflow.data.-dataset.pbtxt

Lines 157 to 160 in fb15a8c

    
             member_method { 
        
               name: "shuffle" 
        
               argspec: "args=[\'self\', \'buffer_size\', \'seed\', \'reshuffle_each_iteration\', \'name\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\'], " 
        
             }

Please updates the Nones to True there as well

DanielYang59 · 2024-01-13T02:19:08Z

Thanks for reviewing and the comments @aaudiber . I'm not quite aware of the API and thanks for the input!

DanielYang59 · 2024-01-17T01:56:32Z

Hi @aaudiber. Thanks for reviewing. Also I noticed some tests are failing (ROCm/MacOS CPU and such) though they don't have a Required tag. Should I be concerned?

DanielYang59 · 2024-02-19T09:04:31Z

This PR has been approved and taged ready to pull for over a month now. Is there anything I should do? Thanks! @gbaned @aaudiber

FUTURE_COPYBARA_INTEGRATE_REVIEW=#62782 from DanielYang59:warn-shuffle 134fc24 PiperOrigin-RevId: 615218481

This thunk wraps the logic to compute dynamic offsets/sizes from dynamic-slice and DUS around some original thunks (e.g. custom call or NCCL thunks) FUTURE_COPYBARA_INTEGRATE_REVIEW=#62782 from DanielYang59:warn-shuffle 134fc24 PiperOrigin-RevId: 615492753

…nment file. FUTURE_COPYBARA_INTEGRATE_REVIEW=#62782 from DanielYang59:warn-shuffle 134fc24 PiperOrigin-RevId: 614792083

…ctly to `True` to avoid confusion Imported from GitHub PR #62782 ## Summary Set the default `reshuffle_each_iteration` value in `tf.data.Dataset.shuffle` method to `True` directly (previous was `None` but was interpreted to `True`) to raise awareness of possible silent data leakage, see discussion #59279. ## Details (copied and cleaned up from #59279) The `Dataset.shuffle` method might lead to **dangerously silent** data leakage: In the [Dataset doc](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) under the `shuffle` method, it reads: > shuffle( > buffer_size, seed=None, reshuffle_each_iteration=None, name=None > ) where the `reshuffle_each_iteration=None` is **super misleading**, as it be easily misinterpreted that reshuffled is off by default (**which is not true at all**). The `shuffle` method, which called `shuffle_op._shuffle`: https://github.com/tensorflow/tensorflow/blob/4dacf3f368eb7965e9b5c3bbdd5193986081c3b2/tensorflow/python/data/ops/dataset_ops.py#L1472-L1473 which then called `_ShuffleDataset`: https://github.com/tensorflow/tensorflow/blob/b756c44e3f3ed52ccb4f05736569b95f4481eea0/tensorflow/python/data/ops/shuffle_op.py#L25-L32 which finally inititate the `_ShuffleDataset` class: https://github.com/tensorflow/tensorflow/blob/b756c44e3f3ed52ccb4f05736569b95f4481eea0/tensorflow/python/data/ops/shuffle_op.py#L35-L50 has the following dangerous definition: ``` if reshuffle_each_iteration is None: reshuffle_each_iteration = True ``` As a result, the default `reshuffle_each_iteration = None` would be interpreted to `reshuffle_each_iteration = True` (which is truly unexpected by user). ## TODO list: - [X] Set the default value of `reshuffle_each_iteration` directly to `True` - [x] Add warning in docs about possible data leakage related to `reshuffle_each_iteration = True` ~~Issue warning when training/validation datasets are split by using the `shuffle + take/skip` pattern~~ (scheduled for a separate follow-up PR) Copybara import of the project: -- ec9557c by Haoyu (Daniel) <yanghaoyu97@outlook.com>: set `reshuffle_each_iteration` directly to `True` -- efbc98e by Haoyu (Daniel) <yanghaoyu97@outlook.com>: add warning in docstring -- fcf84be by Haoyu (Daniel) <yanghaoyu97@outlook.com>: revise wording -- 134fc24 by Haoyu (Daniel) <yanghaoyu97@outlook.com>: change default values in API golden files Merging this change closes #62782 FUTURE_COPYBARA_INTEGRATE_REVIEW=#62782 from DanielYang59:warn-shuffle 134fc24 PiperOrigin-RevId: 615518669

Send-Recv pipeling and record the decision with frontend attributes. We first use a simple heuristics to decide on the decomposition of which CollectivePermute operations will be pipelined. We will only pipeline CollectivePermute that sends loop input data, and pick the first pipelineable CollectivePermute for pipelining. Then, if there is another pipelineable CollectivePermute that forms a cycle with the to-be-pipelined CollectivePermute, we will pipeline both CollectivePermute. Otherwise, we will only pipeline one CollectivePermute. Then, when we decompose CollectivePermute operations, we add a frontend attribute to the Send/Recv operation to represent the pipelining decision. Add tests. FUTURE_COPYBARA_INTEGRATE_REVIEW=#62782 from DanielYang59:warn-shuffle 134fc24 PiperOrigin-RevId: 614419535

…ments. ∘ :: (b -> c) -> (a -> b) -> (a -> c) FUTURE_COPYBARA_INTEGRATE_REVIEW=#62782 from DanielYang59:warn-shuffle 134fc24 PiperOrigin-RevId: 615527130

set reshuffle_each_iteration directly to True

ec9557c

google-ml-butler bot added the size:XS CL Change Size: Extra Small label Jan 11, 2024

google-ml-butler bot assigned gbaned Jan 11, 2024

DanielYang59 added 2 commits January 11, 2024 19:57

add warning in docstring

efbc98e

revise wording

fcf84be

DanielYang59 changed the title ~~Set reshuffle_each_iteration in Dataset.shuffle() directly to True~~ Set reshuffle_each_iteration in Dataset.shuffle() directly to True to avoid possible data leakage Jan 11, 2024

DanielYang59 marked this pull request as ready for review January 12, 2024 01:32

gbaned added the comp:data tf.data related issues label Jan 12, 2024

gbaned added this to Assigned Reviewer in PR Queue via automation Jan 12, 2024

gbaned requested a review from wilsingosti January 12, 2024 04:13

google-ml-butler bot added the awaiting review Pull request awaiting review label Jan 12, 2024

aaudiber approved these changes Jan 12, 2024

View reviewed changes

PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer Jan 12, 2024

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Jan 12, 2024

kokoro-team removed the kokoro:force-run Tests on submitted change label Jan 12, 2024

change default values in API golden files

134fc24

google-ml-butler bot removed the ready to pull PR ready for merge process label Jan 13, 2024

DanielYang59 requested a review from aaudiber January 13, 2024 02:20

aaudiber approved these changes Jan 16, 2024

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Jan 16, 2024

kokoro-team removed the kokoro:force-run Tests on submitted change label Jan 16, 2024

gbaned removed awaiting review Pull request awaiting review ready to pull PR ready for merge process labels Feb 20, 2024

gbaned added the ready to pull PR ready for merge process label Feb 20, 2024

DanielYang59 changed the title ~~Set reshuffle_each_iteration in Dataset.shuffle() directly to True to avoid possible data leakage~~ Set reshuffle_each_iteration in Dataset.shuffle() directly to True to avoid confusion Mar 7, 2024

copybara-service bot pushed a commit that referenced this pull request Mar 13, 2024

#tf-data Support global shuffle for the Tensor slices dataset.

a0c45e2

FUTURE_COPYBARA_INTEGRATE_REVIEW=#62782 from DanielYang59:warn-shuffle 134fc24 PiperOrigin-RevId: 615218481

copybara-service bot mentioned this pull request Mar 13, 2024

#tf-data Support global shuffle for the Tensor slices dataset. #63569

Merged

copybara-service bot mentioned this pull request Mar 13, 2024

[xla:gpu] Add AddressComputationThunk #63639

Merged

copybara-service bot pushed a commit that referenced this pull request Mar 13, 2024

[XLA:MSA] Refactor MemoryBoundLoopOptimizer out of memory_space_assig…

8f67f1d

…nment file. FUTURE_COPYBARA_INTEGRATE_REVIEW=#62782 from DanielYang59:warn-shuffle 134fc24 PiperOrigin-RevId: 614792083

copybara-service bot mentioned this pull request Mar 13, 2024

[XLA:MSA] Refactor MemoryBoundLoopOptimizer out of memory_space_assignment file. #63456

Merged

copybara-service bot mentioned this pull request Mar 13, 2024

PR #62782: Set reshuffle_each_iteration in Dataset.shuffle() directly to True to avoid confusion #63651

Closed

2 tasks

copybara-service bot merged commit eedefdd into tensorflow:master Mar 13, 2024
9 of 13 checks passed

PR Queue automation moved this from Approved by Reviewer to Merged Mar 13, 2024

copybara-service bot mentioned this pull request Mar 13, 2024

[xla] Extend collective-permute decomposer to also make decision for #63390

Merged

copybara-service bot mentioned this pull request Mar 13, 2024

[XLA:GPU][TileAnalysis][NFC] Fix order of function composition in comments. #63653

Merged

DanielYang59 deleted the warn-shuffle branch March 14, 2024 01:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set `reshuffle_each_iteration` in `Dataset.shuffle()` directly to `True` to avoid confusion #62782

Set `reshuffle_each_iteration` in `Dataset.shuffle()` directly to `True` to avoid confusion #62782

DanielYang59 commented Jan 11, 2024 •

edited

DanielYang59 commented Jan 12, 2024 •

edited

aaudiber left a comment

aaudiber commented Jan 12, 2024

DanielYang59 commented Jan 13, 2024

DanielYang59 commented Jan 17, 2024

DanielYang59 commented Feb 19, 2024

	return shuffle_op._shuffle( # pylint: disable=protected-access
	self, buffer_size, seed, reshuffle_each_iteration, name=name)

	def _shuffle( # pylint: disable=unused-private-name
	input_dataset,
	buffer_size,
	seed=None,
	reshuffle_each_iteration=None,
	name=None):
	return _ShuffleDataset(
	input_dataset, buffer_size, seed, reshuffle_each_iteration, name=name)

	class _ShuffleDataset(dataset_ops.UnaryUnchangedStructureDataset):
	"""A `Dataset` that randomly shuffles the elements of its input."""

	def __init__(self,
	input_dataset,
	buffer_size,
	seed=None,
	reshuffle_each_iteration=None,
	name=None):
	"""See `Dataset.shuffle()` for details."""
	self._input_dataset = input_dataset
	self._buffer_size = ops.convert_to_tensor(
	buffer_size, dtype=dtypes.int64, name="buffer_size")
	self._seed, self._seed2 = random_seed.get_seed(seed)
	if reshuffle_each_iteration is None:
	reshuffle_each_iteration = True

Set reshuffle_each_iteration in Dataset.shuffle() directly to True to avoid confusion #62782

Set reshuffle_each_iteration in Dataset.shuffle() directly to True to avoid confusion #62782

Conversation

DanielYang59 commented Jan 11, 2024 • edited

Summary

Details (copied and cleaned up from #59279)

TODO list:

DanielYang59 commented Jan 12, 2024 • edited

aaudiber left a comment

Choose a reason for hiding this comment

aaudiber commented Jan 12, 2024

DanielYang59 commented Jan 13, 2024

DanielYang59 commented Jan 17, 2024

DanielYang59 commented Feb 19, 2024

Set `reshuffle_each_iteration` in `Dataset.shuffle()` directly to `True` to avoid confusion #62782

Set `reshuffle_each_iteration` in `Dataset.shuffle()` directly to `True` to avoid confusion #62782

DanielYang59 commented Jan 11, 2024 •

edited

DanielYang59 commented Jan 12, 2024 •

edited