[tf.data] graduate rejection_resample API from experimental to tf.data.Dataset #48894

kvignesh1420 · 2021-05-04T14:12:09Z

This PR graduates the tf.data.experimental.rejection_resample API into tf.data.Dataset.rejection_resample by making the following changes:

Adds the deprecation decorator for the experimental API.
Add the rejection_resample() method to DatasetV2 class.
Updates example in documentation with new API.
Regenerate golden API's.
Moved and updated the rejection_resample_test target from experimental/kernel_tests to kernel_tests
Updated the RELEASE.md file

TEST LOG

INFO: Build completed successfully, 8120 total actions
//tensorflow/python/data/kernel_tests:rejection_resample_test            PASSED in 5.1s
  Stats over 10 runs: max = 5.1s, min = 2.6s, avg = 3.6s, dev = 0.9s

kvignesh1420 · 2021-05-04T17:30:48Z

In [81]: import numpy as np
    ...: import tensorflow as tf
    ...:
    ...: init_dist = [0.6 , 0.4]
    ...: target_dist = [0.5, 0.5]
    ...: num_classes = len(init_dist)
    ...: num_samples = 10000
    ...: data_np = np.random.choice(num_classes, num_samples, p=init_dist)
    ...: dataset = tf.data.Dataset.from_tensor_slices(data_np)
    ...: vals = defaultdict(int)
    ...: for i in dataset:
    ...:   vals[i.numpy()]+=1
    ...: print("Initial distribution: {}".format(vals))
Initial distribution: defaultdict(<class 'int'>, {1: 4040, 0: 5960})

In [82]: resampler = tf.data.experimental.rejection_resample(
    ...:             class_func=lambda x: x,
    ...:             target_dist=target_dist,
    ...:             initial_dist=init_dist)
    ...:
    ...: dataset = dataset.apply(resampler)
    ...:
    ...: from collections import defaultdict
    ...: vals = defaultdict(int)
    ...: for i in dataset:
    ...:   vals[i[-1].numpy()]+=1
    ...: print("Resampled distribution: {}".format(vals))
Proportion of examples rejected by sampler is high: [0.6][0.6 0.4][0 1]
Proportion of examples rejected by sampler is high: [0.6][0.6 0.4][0 1]
Proportion of examples rejected by sampler is high: [0.6][0.6 0.4][0 1]
Proportion of examples rejected by sampler is high: [0.6][0.6 0.4][0 1]
Proportion of examples rejected by sampler is high: [0.6][0.6 0.4][0 1]
Proportion of examples rejected by sampler is high: [0.6][0.6 0.4][0 1]
Proportion of examples rejected by sampler is high: [0.6][0.6 0.4][0 1]
Proportion of examples rejected by sampler is high: [0.6][0.6 0.4][0 1]
Proportion of examples rejected by sampler is high: [0.6][0.6 0.4][0 1]
Proportion of examples rejected by sampler is high: [0.6][0.6 0.4][0 1]
Resampled distribution: defaultdict(<class 'int'>, {1: 8080, 0: 5960})

cc: @jsimsa I was playing around with the API and observed some weird behavior. In the above example, it seems like elements are being added to the dataset instead of being removed.

jsimsa · 2021-05-05T20:54:36Z

The dataset produced by rejection_resampling internally uses sample_from_datasets to sample from a) the original dataset and b) a filtered instance of the original dataset (code).

In other words, it is not unexpected that the number of output elements is greater that the cardinality of original input dataset but it is unexpected that the distribution does not match the target distribution. This is related to an issue we have recently fixed for sample_from_datasets, where the transformation would stop respecting the sampling distribution as some of the input datasets to sample from become empty. I will send a follow up CL which will update rejection_resampling implementation to use the new sample_from_dataset argument, which will fix the issue where the distribution of resampled dataset will fail to match the target distribution (in the case where a most or all elements of the resampled dataset are consumed).

@yangustc07 FYI

With my fix patched, the following program (which I adopted from your example):

import collections
import tensorflow as tf

init_dist = [0.5, 0.5]
target_dist = [0.6, 0.4]
dataset = tf.data.Dataset.range(100000)
x = collections.defaultdict(int)
for i in dataset:
  x[i.numpy() % 2] += 1
print("Initial distribution: {}".format(x))

resampler = tf.data.experimental.rejection_resample(
    class_func=lambda x: x % 2,
    target_dist=target_dist,
    initial_dist=init_dist)
dataset = dataset.apply(resampler)

y = collections.defaultdict(int)
for i in dataset:
  cls, _ = i
  y[cls.numpy()] += 1
print("Resampled distribution: {}".format(y))

Produces the following output:

Initial distribution: defaultdict(<class 'int'>, {0: 50000, 1: 50000})
Proportion of examples rejected by sampler is high: [0.5][0.5 0.5][1 0]
Proportion of examples rejected by sampler is high: [0.5][0.5 0.5][1 0]
Proportion of examples rejected by sampler is high: [0.5][0.5 0.5][1 0]
Proportion of examples rejected by sampler is high: [0.5][0.5 0.5][1 0]
Proportion of examples rejected by sampler is high: [0.5][0.5 0.5][1 0]
Proportion of examples rejected by sampler is high: [0.5][0.5 0.5][1 0]
Proportion of examples rejected by sampler is high: [0.5][0.5 0.5][1 0]
Proportion of examples rejected by sampler is high: [0.5][0.5 0.5][1 0]
Proportion of examples rejected by sampler is high: [0.5][0.5 0.5][1 0]
Proportion of examples rejected by sampler is high: [0.5][0.5 0.5][1 0]
Resampled distribution: defaultdict(<class 'int'>, {0: 75089, 1: 50000})

tensorflow/python/data/ops/dataset_ops.py

kvignesh1420 · 2021-05-08T09:48:07Z

PR #49009 has been raised as a pre-requisite for the current one.

gbaned · 2021-06-25T17:46:27Z

@kvignesh1420 Can you please resolve conflicts? Thanks!

kvignesh1420 · 2021-06-28T19:20:43Z

@gbaned the file changes in this PR conflict with the ones in pre-requisite PR's. I will resolve all the conflicts in this PR once the prereqs are merged. Hope it's fine.

gbaned · 2021-07-08T11:13:16Z

@kvignesh1420 Can you please resolve conflicts? Thanks!

jsimsa · 2021-07-29T15:25:29Z

@aaudiber could you please take a look? thanks

aaudiber

Thanks @kvignesh1420

tensorflow/python/data/ops/dataset_ops.py

aaudiber · 2021-08-03T00:20:57Z

tensorflow/python/data/ops/dataset_ops.py

+      A `Dataset`
+    """
+
+    target_dist_t = ops.convert_to_tensor(target_dist, name="target_dist")


Moving all of our implementations into dataset_ops.py is making dataset_ops.py quite long and hard to navigate. In later PRs we should consider moving the dataset transformations from dataset_ops.py into their own files, similar to what we do for experimental ops. It would also make graduating experimental ops more straightforward. @jsimsa what are your thoughts?

This is a valid concern and I think it would be a useful refactor. That said, we should do this in a manner which does not require updating callsites that import dataset_ops. One option would be that we keep dataset_ops.py as a shim to import symbols from the "per transformation" modules.

@aaudiber @jsimsa we might also want to consider the circular dependencies and file duplications during the promotion/refactor process if we are going to maintain separate files per transformation and use them in dataset_ops.py. Also, if we are using dataset_ops.py as a shim where the API layout is unaffected for the users, we might have to maintain a new file with the actual functionality of the API's in dataset_ops.py so that circular dependencies can be prevented. WDYT?

+1 for making dataset_ops a shim. The refactor should have no user-facing impact. We can use LazyLoader in the dataset impl files to handle the circular reference on dataset_ops.py. Keeping dataset_ops readable is much more important than avoiding circular dependencies between dataset implementations and dataset_ops.

okay sounds good!

google-ml-butler bot added the size:L CL Change Size: Large label May 4, 2021

google-cla bot added the cla: yes label May 4, 2021

gbaned self-assigned this May 5, 2021

gbaned added the comp:data tf.data related issues label May 5, 2021

gbaned added this to Assigned Reviewer in PR Queue via automation May 5, 2021

kvignesh1420 marked this pull request as ready for review May 5, 2021 08:17

kvignesh1420 force-pushed the graduate-rejection-resample branch from 6c91ddd to 1e21166 Compare May 6, 2021 14:19

gbaned requested a review from jsimsa May 6, 2021 14:55

kvignesh1420 commented May 6, 2021

View reviewed changes

tensorflow/python/data/ops/dataset_ops.py Outdated Show resolved Hide resolved

jsimsa reviewed May 6, 2021

View reviewed changes

tensorflow/python/data/ops/dataset_ops.py Outdated Show resolved Hide resolved

kvignesh1420 commented May 7, 2021

View reviewed changes

tensorflow/python/data/ops/dataset_ops.py Outdated Show resolved Hide resolved

gbaned added the stat:awaiting response Status - Awaiting response from author label Jun 25, 2021

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Jul 16, 2021

gbaned added the stat:awaiting response Status - Awaiting response from author label Jul 16, 2021

kvignesh1420 added 10 commits July 21, 2021 21:51

[tf.data] graduate rejection_resample API from experimental to Dataset

9ef3c95

lazy load random_ops

0d4ee23

lazy load interleave_ops and scan_ops

330a1c2

regenerate golden apis

500570d

move rejection_resample_test to kernel_tests

96d6486

update RELEASE.md

106926f

sanity fixes

65a6970

add docstring example for rejection_resample

f02349e

sanity fixes

65cc5e4

normal import random_ops

62ee3f1

import collections for doctest

7e0ace2

kvignesh1420 force-pushed the graduate-rejection-resample branch from 7996827 to 7e0ace2 Compare July 21, 2021 16:30

switch to promoted scan function

895caf1

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Jul 27, 2021

gbaned requested a review from jsimsa July 29, 2021 15:22

google-ml-butler bot added the awaiting review Pull request awaiting review label Jul 29, 2021

jsimsa requested review from aaudiber and removed request for jsimsa July 29, 2021 15:25

aaudiber requested changes Aug 3, 2021

View reviewed changes

PR Queue automation moved this from Assigned Reviewer to Reviewer Requested Changes Aug 3, 2021

rename init_dist to initial_dist to maintain consistency

451c651

kvignesh1420 requested a review from aaudiber August 3, 2021 13:57

aaudiber approved these changes Aug 3, 2021

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Aug 3, 2021

PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer Aug 3, 2021

kokoro-team removed the kokoro:force-run Tests on submitted change label Aug 3, 2021

gbaned removed the awaiting review Pull request awaiting review label Aug 4, 2021

copybara-service bot merged commit 1f0dfdb into tensorflow:master Aug 6, 2021

PR Queue automation moved this from Approved by Reviewer to Merged Aug 6, 2021

google-ml-butler bot removed the ready to pull PR ready for merge process label Aug 6, 2021

kvignesh1420 deleted the graduate-rejection-resample branch August 6, 2021 23:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tf.data] graduate rejection_resample API from experimental to tf.data.Dataset #48894

[tf.data] graduate rejection_resample API from experimental to tf.data.Dataset #48894

kvignesh1420 commented May 4, 2021 •

edited

kvignesh1420 commented May 4, 2021 •

edited

jsimsa commented May 5, 2021

kvignesh1420 commented May 8, 2021

gbaned commented Jun 25, 2021

kvignesh1420 commented Jun 28, 2021

gbaned commented Jul 8, 2021

jsimsa commented Jul 29, 2021

aaudiber left a comment

aaudiber Aug 3, 2021

jsimsa Aug 3, 2021

kvignesh1420 Aug 3, 2021 •

edited

aaudiber Aug 3, 2021

kvignesh1420 Aug 3, 2021

[tf.data] graduate rejection_resample API from experimental to tf.data.Dataset #48894

[tf.data] graduate rejection_resample API from experimental to tf.data.Dataset #48894

Conversation

kvignesh1420 commented May 4, 2021 • edited

kvignesh1420 commented May 4, 2021 • edited

jsimsa commented May 5, 2021

kvignesh1420 commented May 8, 2021

gbaned commented Jun 25, 2021

kvignesh1420 commented Jun 28, 2021

gbaned commented Jul 8, 2021

jsimsa commented Jul 29, 2021

aaudiber left a comment

Choose a reason for hiding this comment

aaudiber Aug 3, 2021

Choose a reason for hiding this comment

jsimsa Aug 3, 2021

Choose a reason for hiding this comment

kvignesh1420 Aug 3, 2021 • edited

Choose a reason for hiding this comment

aaudiber Aug 3, 2021

Choose a reason for hiding this comment

kvignesh1420 Aug 3, 2021

Choose a reason for hiding this comment

kvignesh1420 commented May 4, 2021 •

edited

kvignesh1420 commented May 4, 2021 •

edited

kvignesh1420 Aug 3, 2021 •

edited