Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oversampling functionality in dataset API #14451

Closed
kmkolasinski opened this issue Nov 10, 2017 · 6 comments
Closed

Oversampling functionality in dataset API #14451

kmkolasinski opened this issue Nov 10, 2017 · 6 comments
Labels
type:support Support issues

Comments

@kmkolasinski
Copy link

Hello,
I would like to ask if current API of datasets allows for implementation of oversampling algorithm? I deal with highly imbalanced class problem. I was thinking that it would be nice to oversample specific classes during dataset parsing i.e. online generation. I've seen the implementation for rejection_resample function, however this removes samples instead of duplicating them and its slows down batch generation (when target distribution is much different then initial one). The thing I would like to achieve is: to take an example, look at its class probability decide if duplicate it or not. Then call dataset.shuffle(...) dataset.batch(...) and get iterator. The best (in my opinion) approach would be to oversample low probable classes and subsample most probable ones. I would like to do it online since it's more flexible. Just wondering if this is possible with current API?

@mrry
Copy link
Contributor

mrry commented Nov 10, 2017

If you have a function f that returns the number of times to duplicate an element, you can data-dependently repeat an element using dataset.flat_map(lambda x: tf.data.Dataset.from_tensors(x).repeat(f(x)). Perhaps that would work?

@kmkolasinski
Copy link
Author

Thanks I will try this.

@angerson angerson added stat:awaiting response Status - Awaiting response from author type:support Support issues labels Nov 10, 2017
@kmkolasinski
Copy link
Author

kmkolasinski commented Nov 10, 2017

Hi @mrry , I've checked it, and it works perfectly 👍 Thank you very much :)
Here is a sample code which I'm going to try in my image classification problem. I'm posting it here in case if someone will want to experiment with the method. The code is oversampling low frequent classes and undersampling high frequent classes, where class_target_prob is just uniform distribution in my case. I wanted to check conclusions from recent manuscript A systematic study of the class imbalance problem in convolutional neural networks

Here is the code:

# sampling parameters
oversampling_coef = 0.9 # if equal to 0 then oversample_classes() always returns 1
undersampling_coef = 0.5 # if equal to 0 then oversampling_filter() always returns True

def oversample_classes(example):
    """
    Returns the number of copies of given example
    """
    class_prob = example['class_prob']
    class_target_prob = example['class_target_prob']
    prob_ratio = tf.cast(class_target_prob/class_prob, dtype=tf.float32)
    # soften ratio is oversampling_coef==0 we recover original distribution
    prob_ratio = prob_ratio ** oversampling_coef 
    # for classes with probability higher than class_target_prob we
    # want to return 1
    prob_ratio = tf.maximum(prob_ratio, 1) 
    # for low probability classes this number will be very large
    repeat_count = tf.floor(prob_ratio)
    # prob_ratio can be e.g 1.9 which means that there is still 90%
    # of change that we should return 2 instead of 1
    repeat_residual = prob_ratio - repeat_count # a number between 0-1
    residual_acceptance = tf.less_equal(
                        tf.random_uniform([], dtype=tf.float32), repeat_residual
    )
    
    residual_acceptance = tf.cast(residual_acceptance, tf.int64)
    repeat_count = tf.cast(repeat_count, dtype=tf.int64)
    
    return repeat_count + residual_acceptance


def undersampling_filter(example):
    """
    Computes if given example is rejected or not.
    """
    class_prob = example['class_prob']
    class_target_prob = example['class_target_prob']
    prob_ratio = tf.cast(class_target_prob/class_prob, dtype=tf.float32)
    prob_ratio = prob_ratio ** undersampling_coef
    prob_ratio = tf.minimum(prob_ratio, 1.0)
    
    acceptance = tf.less_equal(tf.random_uniform([], dtype=tf.float32), prob_ratio)

    return acceptance


dataset = dataset.flat_map(
    lambda x: tf.data.Dataset.from_tensors(x).repeat(oversample_classes(x))
)

dataset = dataset.filter(undersampling_filter)

dataset = dataset.repeat(-1)
dataset = dataset.shuffle(2048)
dataset = dataset.batch(32)

sess.run(tf.global_variables_initializer())

iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

I think this issue can be closed now. Thanks again 👍

@aselle aselle removed the stat:awaiting response Status - Awaiting response from author label Nov 10, 2017
@angerson
Copy link
Contributor

angerson commented Nov 10, 2017

Since your question has been resolved, also consider posting an answered question to Stack Overflow, where it's likely to reach more developers with a similar question. Thanks!

@mrry
Copy link
Contributor

mrry commented Nov 10, 2017

Glad to hear that it worked! Thanks for posting the example, and +1 to @angersson's suggestion!

@kmkolasinski
Copy link
Author

I've followed the @angersson's suggestion and posted answered question to Stack Overflow. Here is the link Q&A. Once more time thanks for quick feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:support Support issues
Projects
None yet
Development

No branches or pull requests

4 participants