Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tf.data] graduate rejection_resample API from experimental to tf.data.Dataset #48894

Merged

Conversation

kvignesh1420
Copy link
Member

@kvignesh1420 kvignesh1420 commented May 4, 2021

This PR graduates the tf.data.experimental.rejection_resample API into tf.data.Dataset.rejection_resample by making the following changes:

  • Adds the deprecation decorator for the experimental API.
  • Add the rejection_resample() method to DatasetV2 class.
  • Updates example in documentation with new API.
  • Regenerate golden API's.
  • Moved and updated the rejection_resample_test target from experimental/kernel_tests to kernel_tests
  • Updated the RELEASE.md file

TEST LOG

INFO: Build completed successfully, 8120 total actions
//tensorflow/python/data/kernel_tests:rejection_resample_test            PASSED in 5.1s
  Stats over 10 runs: max = 5.1s, min = 2.6s, avg = 3.6s, dev = 0.9s

@google-ml-butler google-ml-butler bot added the size:L CL Change Size: Large label May 4, 2021
@google-cla google-cla bot added the cla: yes label May 4, 2021
@kvignesh1420
Copy link
Member Author

kvignesh1420 commented May 4, 2021

In [81]: import numpy as np
    ...: import tensorflow as tf
    ...:
    ...: init_dist = [0.6 , 0.4]
    ...: target_dist = [0.5, 0.5]
    ...: num_classes = len(init_dist)
    ...: num_samples = 10000
    ...: data_np = np.random.choice(num_classes, num_samples, p=init_dist)
    ...: dataset = tf.data.Dataset.from_tensor_slices(data_np)
    ...: vals = defaultdict(int)
    ...: for i in dataset:
    ...:   vals[i.numpy()]+=1
    ...: print("Initial distribution: {}".format(vals))
Initial distribution: defaultdict(<class 'int'>, {1: 4040, 0: 5960})

In [82]: resampler = tf.data.experimental.rejection_resample(
    ...:             class_func=lambda x: x,
    ...:             target_dist=target_dist,
    ...:             initial_dist=init_dist)
    ...:
    ...: dataset = dataset.apply(resampler)
    ...:
    ...: from collections import defaultdict
    ...: vals = defaultdict(int)
    ...: for i in dataset:
    ...:   vals[i[-1].numpy()]+=1
    ...: print("Resampled distribution: {}".format(vals))
Proportion of examples rejected by sampler is high: [0.6][0.6 0.4][0 1]
Proportion of examples rejected by sampler is high: [0.6][0.6 0.4][0 1]
Proportion of examples rejected by sampler is high: [0.6][0.6 0.4][0 1]
Proportion of examples rejected by sampler is high: [0.6][0.6 0.4][0 1]
Proportion of examples rejected by sampler is high: [0.6][0.6 0.4][0 1]
Proportion of examples rejected by sampler is high: [0.6][0.6 0.4][0 1]
Proportion of examples rejected by sampler is high: [0.6][0.6 0.4][0 1]
Proportion of examples rejected by sampler is high: [0.6][0.6 0.4][0 1]
Proportion of examples rejected by sampler is high: [0.6][0.6 0.4][0 1]
Proportion of examples rejected by sampler is high: [0.6][0.6 0.4][0 1]
Resampled distribution: defaultdict(<class 'int'>, {1: 8080, 0: 5960})

cc: @jsimsa I was playing around with the API and observed some weird behavior. In the above example, it seems like elements are being added to the dataset instead of being removed.

@gbaned gbaned self-assigned this May 5, 2021
@gbaned gbaned added the comp:data tf.data related issues label May 5, 2021
@gbaned gbaned added this to Assigned Reviewer in PR Queue via automation May 5, 2021
@kvignesh1420 kvignesh1420 marked this pull request as ready for review May 5, 2021 08:17
@jsimsa
Copy link
Contributor

jsimsa commented May 5, 2021

The dataset produced by rejection_resampling internally uses sample_from_datasets to sample from a) the original dataset and b) a filtered instance of the original dataset (code).

In other words, it is not unexpected that the number of output elements is greater that the cardinality of original input dataset but it is unexpected that the distribution does not match the target distribution. This is related to an issue we have recently fixed for sample_from_datasets, where the transformation would stop respecting the sampling distribution as some of the input datasets to sample from become empty. I will send a follow up CL which will update rejection_resampling implementation to use the new sample_from_dataset argument, which will fix the issue where the distribution of resampled dataset will fail to match the target distribution (in the case where a most or all elements of the resampled dataset are consumed).

@yangustc07 FYI

With my fix patched, the following program (which I adopted from your example):

import collections
import tensorflow as tf

init_dist = [0.5, 0.5]
target_dist = [0.6, 0.4]
dataset = tf.data.Dataset.range(100000)
x = collections.defaultdict(int)
for i in dataset:
  x[i.numpy() % 2] += 1
print("Initial distribution: {}".format(x))

resampler = tf.data.experimental.rejection_resample(
    class_func=lambda x: x % 2,
    target_dist=target_dist,
    initial_dist=init_dist)
dataset = dataset.apply(resampler)

y = collections.defaultdict(int)
for i in dataset:
  cls, _ = i
  y[cls.numpy()] += 1
print("Resampled distribution: {}".format(y))

Produces the following output:

Initial distribution: defaultdict(<class 'int'>, {0: 50000, 1: 50000})
Proportion of examples rejected by sampler is high: [0.5][0.5 0.5][1 0]
Proportion of examples rejected by sampler is high: [0.5][0.5 0.5][1 0]
Proportion of examples rejected by sampler is high: [0.5][0.5 0.5][1 0]
Proportion of examples rejected by sampler is high: [0.5][0.5 0.5][1 0]
Proportion of examples rejected by sampler is high: [0.5][0.5 0.5][1 0]
Proportion of examples rejected by sampler is high: [0.5][0.5 0.5][1 0]
Proportion of examples rejected by sampler is high: [0.5][0.5 0.5][1 0]
Proportion of examples rejected by sampler is high: [0.5][0.5 0.5][1 0]
Proportion of examples rejected by sampler is high: [0.5][0.5 0.5][1 0]
Proportion of examples rejected by sampler is high: [0.5][0.5 0.5][1 0]
Resampled distribution: defaultdict(<class 'int'>, {0: 75089, 1: 50000})

@gbaned gbaned requested a review from jsimsa May 6, 2021 14:55
@kvignesh1420
Copy link
Member Author

PR #49009 has been raised as a pre-requisite for the current one.

@gbaned
Copy link
Contributor

gbaned commented Jun 25, 2021

@kvignesh1420 Can you please resolve conflicts? Thanks!

@gbaned gbaned added the stat:awaiting response Status - Awaiting response from author label Jun 25, 2021
@kvignesh1420
Copy link
Member Author

@gbaned the file changes in this PR conflict with the ones in pre-requisite PR's. I will resolve all the conflicts in this PR once the prereqs are merged. Hope it's fine.

@gbaned
Copy link
Contributor

gbaned commented Jul 8, 2021

@kvignesh1420 Can you please resolve conflicts? Thanks!

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Jul 16, 2021
@gbaned gbaned added the stat:awaiting response Status - Awaiting response from author label Jul 16, 2021
@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Jul 27, 2021
@gbaned gbaned requested a review from jsimsa July 29, 2021 15:22
@google-ml-butler google-ml-butler bot added the awaiting review Pull request awaiting review label Jul 29, 2021
@jsimsa jsimsa requested review from aaudiber and removed request for jsimsa July 29, 2021 15:25
@jsimsa
Copy link
Contributor

jsimsa commented Jul 29, 2021

@aaudiber could you please take a look? thanks

Copy link
Contributor

@aaudiber aaudiber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kvignesh1420

tensorflow/python/data/ops/dataset_ops.py Show resolved Hide resolved
tensorflow/python/data/ops/dataset_ops.py Show resolved Hide resolved
A `Dataset`
"""

target_dist_t = ops.convert_to_tensor(target_dist, name="target_dist")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving all of our implementations into dataset_ops.py is making dataset_ops.py quite long and hard to navigate. In later PRs we should consider moving the dataset transformations from dataset_ops.py into their own files, similar to what we do for experimental ops. It would also make graduating experimental ops more straightforward. @jsimsa what are your thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valid concern and I think it would be a useful refactor. That said, we should do this in a manner which does not require updating callsites that import dataset_ops. One option would be that we keep dataset_ops.py as a shim to import symbols from the "per transformation" modules.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaudiber @jsimsa we might also want to consider the circular dependencies and file duplications during the promotion/refactor process if we are going to maintain separate files per transformation and use them in dataset_ops.py. Also, if we are using dataset_ops.py as a shim where the API layout is unaffected for the users, we might have to maintain a new file with the actual functionality of the API's in dataset_ops.py so that circular dependencies can be prevented. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for making dataset_ops a shim. The refactor should have no user-facing impact. We can use LazyLoader in the dataset impl files to handle the circular reference on dataset_ops.py. Keeping dataset_ops readable is much more important than avoiding circular dependencies between dataset implementations and dataset_ops.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay sounds good!

PR Queue automation moved this from Assigned Reviewer to Reviewer Requested Changes Aug 3, 2021
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Aug 3, 2021
PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer Aug 3, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Aug 3, 2021
@gbaned gbaned removed the awaiting review Pull request awaiting review label Aug 4, 2021
@copybara-service copybara-service bot merged commit 1f0dfdb into tensorflow:master Aug 6, 2021
PR Queue automation moved this from Approved by Reviewer to Merged Aug 6, 2021
@google-ml-butler google-ml-butler bot removed the ready to pull PR ready for merge process label Aug 6, 2021
@kvignesh1420 kvignesh1420 deleted the graduate-rejection-resample branch August 6, 2021 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes comp:data tf.data related issues size:L CL Change Size: Large
Projects
PR Queue
  
Merged
Development

Successfully merging this pull request may close these issues.

None yet

6 participants