Oversampling functionality in dataset API #14451

kmkolasinski · 2017-11-10T14:47:09Z

Hello,
I would like to ask if current API of datasets allows for implementation of oversampling algorithm? I deal with highly imbalanced class problem. I was thinking that it would be nice to oversample specific classes during dataset parsing i.e. online generation. I've seen the implementation for rejection_resample function, however this removes samples instead of duplicating them and its slows down batch generation (when target distribution is much different then initial one). The thing I would like to achieve is: to take an example, look at its class probability decide if duplicate it or not. Then call dataset.shuffle(...) dataset.batch(...) and get iterator. The best (in my opinion) approach would be to oversample low probable classes and subsample most probable ones. I would like to do it online since it's more flexible. Just wondering if this is possible with current API?

The text was updated successfully, but these errors were encountered:

mrry · 2017-11-10T14:51:26Z

If you have a function f that returns the number of times to duplicate an element, you can data-dependently repeat an element using dataset.flat_map(lambda x: tf.data.Dataset.from_tensors(x).repeat(f(x)). Perhaps that would work?

kmkolasinski · 2017-11-10T15:04:00Z

Thanks I will try this.

kmkolasinski · 2017-11-10T21:10:35Z

Hi @mrry , I've checked it, and it works perfectly 👍 Thank you very much :)
Here is a sample code which I'm going to try in my image classification problem. I'm posting it here in case if someone will want to experiment with the method. The code is oversampling low frequent classes and undersampling high frequent classes, where class_target_prob is just uniform distribution in my case. I wanted to check conclusions from recent manuscript A systematic study of the class imbalance problem in convolutional neural networks

Here is the code:

# sampling parameters
oversampling_coef = 0.9 # if equal to 0 then oversample_classes() always returns 1
undersampling_coef = 0.5 # if equal to 0 then oversampling_filter() always returns True

def oversample_classes(example):
    """
    Returns the number of copies of given example
    """
    class_prob = example['class_prob']
    class_target_prob = example['class_target_prob']
    prob_ratio = tf.cast(class_target_prob/class_prob, dtype=tf.float32)
    # soften ratio is oversampling_coef==0 we recover original distribution
    prob_ratio = prob_ratio ** oversampling_coef 
    # for classes with probability higher than class_target_prob we
    # want to return 1
    prob_ratio = tf.maximum(prob_ratio, 1) 
    # for low probability classes this number will be very large
    repeat_count = tf.floor(prob_ratio)
    # prob_ratio can be e.g 1.9 which means that there is still 90%
    # of change that we should return 2 instead of 1
    repeat_residual = prob_ratio - repeat_count # a number between 0-1
    residual_acceptance = tf.less_equal(
                        tf.random_uniform([], dtype=tf.float32), repeat_residual
    )
    
    residual_acceptance = tf.cast(residual_acceptance, tf.int64)
    repeat_count = tf.cast(repeat_count, dtype=tf.int64)
    
    return repeat_count + residual_acceptance


def undersampling_filter(example):
    """
    Computes if given example is rejected or not.
    """
    class_prob = example['class_prob']
    class_target_prob = example['class_target_prob']
    prob_ratio = tf.cast(class_target_prob/class_prob, dtype=tf.float32)
    prob_ratio = prob_ratio ** undersampling_coef
    prob_ratio = tf.minimum(prob_ratio, 1.0)
    
    acceptance = tf.less_equal(tf.random_uniform([], dtype=tf.float32), prob_ratio)

    return acceptance


dataset = dataset.flat_map(
    lambda x: tf.data.Dataset.from_tensors(x).repeat(oversample_classes(x))
)

dataset = dataset.filter(undersampling_filter)

dataset = dataset.repeat(-1)
dataset = dataset.shuffle(2048)
dataset = dataset.batch(32)

sess.run(tf.global_variables_initializer())

iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

I think this issue can be closed now. Thanks again 👍

angerson · 2017-11-10T21:33:13Z

Since your question has been resolved, also consider posting an answered question to Stack Overflow, where it's likely to reach more developers with a similar question. Thanks!

mrry · 2017-11-10T21:35:52Z

Glad to hear that it worked! Thanks for posting the example, and +1 to @angersson's suggestion!

kmkolasinski · 2017-11-11T09:43:42Z

I've followed the @angersson's suggestion and posted answered question to Stack Overflow. Here is the link Q&A. Once more time thanks for quick feedback.

angerson added stat:awaiting response Status - Awaiting response from author type:support Support issues labels Nov 10, 2017

aselle removed the stat:awaiting response Status - Awaiting response from author label Nov 10, 2017

angerson closed this as completed Nov 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Oversampling functionality in dataset API #14451

Oversampling functionality in dataset API #14451

kmkolasinski commented Nov 10, 2017

mrry commented Nov 10, 2017

kmkolasinski commented Nov 10, 2017

kmkolasinski commented Nov 10, 2017 •

edited

Loading

angerson commented Nov 10, 2017 •

edited

Loading

mrry commented Nov 10, 2017

kmkolasinski commented Nov 11, 2017

Oversampling functionality in dataset API #14451

Oversampling functionality in dataset API #14451

Comments

kmkolasinski commented Nov 10, 2017

mrry commented Nov 10, 2017

kmkolasinski commented Nov 10, 2017

kmkolasinski commented Nov 10, 2017 • edited Loading

angerson commented Nov 10, 2017 • edited Loading

mrry commented Nov 10, 2017

kmkolasinski commented Nov 11, 2017

kmkolasinski commented Nov 10, 2017 •

edited

Loading

angerson commented Nov 10, 2017 •

edited

Loading