New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataSet user provided shuffled order #14518

Open
YoelShoshan opened this Issue Nov 13, 2017 · 17 comments

Comments

Projects
None yet
5 participants
@YoelShoshan

YoelShoshan commented Nov 13, 2017

It would be very useful if a user will be able to provide the order of examples within a dataset (with repetitions allowed, as only indices are shuffled).
This would allow having a more complicated logic (which involves balancing data of different types).

I assume it can be somehow supported by zip-ing together different DataSets, but it would be MUCH easier and more flexible if we could just pass a list of indices. Probably light as well as it shouldn't be a big deal passing one list per epoch.

Please tell me if this feature already exists, and if not, please add it :)

@mrry

This comment has been minimized.

Contributor

mrry commented Nov 13, 2017

Can you sketch the kind of syntax you'd like for this feature? Do you assume that the entire dataset to be shuffled fits in memory?

I'm not sure whether this should be part of the API or a "standard recipe" that we document, but that would depend on how complicated the recipe ends up being....

@mrry

This comment has been minimized.

Contributor

mrry commented Nov 13, 2017

For example, if we required the data to fit into memory, something like the following would work:

x = ...      # Tensor of shape [n, ...] containing all the feature values
y = ...      # Tensor of shape [n] containing all the labels
order = ...  # Tensor of shape [?] containing the desired order

dataset = tf.data.Dataset.from_tensor_slices(order).map(lambda i: (x[i], y[i]))
@YoelShoshan

This comment has been minimized.

YoelShoshan commented Nov 13, 2017

Thanks for the quick response!
I do not assume a dataset that fits entirely in memory, actually, my main usage scenario is datasets which are far from that.

In the current syntax, to shuffle your elements, you can do something like:

ds = TFRecordDataset([filename])
ds = ds.Shuffle(...)

But that's not useful for unbalanced samples classes scenarios.

My thought was adding something like this:

ds = TFRecordDataset([filename])
examples_order = [0,1,2,4,5,0,3,2,0,3,4,2,0,................] #this was calculated by the user as the desired order (notice the repetition of examples)
ds = ds.SetExamplesOrder(examples_order)

An example of a common scenario where this is useful:
In total in your dataset, you have 100m samples of class A, 10k of class B, 10k of class C and 10k of class D.
Let's say that while training, you want minibatches to be balanced.
So let's say that your minibatch is of size 10, and you want 2 samples of each class.

Using this syntax, you can manually use whatever logic you prefer, and all that needs to be passed is the indices list.

Another option is to (also?) allow the user to provide custom shuffling python function. This function can be activated before every new epoch if requested.

def my_foo(seed):
    ... any user logic that returns a list of indices ...
    return examples_order

ds = TFRecordDataset([filename])
ds = ds.CustomShuffling(my_foo)

Now, as I mentioned previously, you might say that this is achievable by creating few DataSet instances and zip-ing them together, however, I feel that it complicates things a bit too much, especially since it's comfortable to have all data in a single TFRecords file, and I don't see why not give the user the ability to provide whatever custom indices order he/she wants.

I believe that this would make the API much more useful especially for people who are outside the "natural images" domain. For example - the medical domain which is in its nature highly unbalanced.

@YoelShoshan

This comment has been minimized.

YoelShoshan commented Nov 14, 2017

Also, does DataSet.shuffle(buffer_size) require that buffer_size elements will fit into memory? (the actual outputs of the iterator) because if it does, it makes it not very useful in a tfrecords scenario of a big file.

@jmaye

This comment has been minimized.

jmaye commented Nov 16, 2017

In case you need a workaround for balancing, here is what I did:

  • put your data into folders, each folder is a class
  • shuffle the folders
  • get one shuffled folders, shuffle the content of the folder and pick as many elements you wish

You can then do stuff like give me a batch of 10 classes with 5 samples per class for instance.

But this should definitely be handled better in Tensorflow.

@YoelShoshan

This comment has been minimized.

YoelShoshan commented Nov 19, 2017

Thanks @jmaye !
Actually, in my case I can't really do that (without bloating the storage), because classes are a bit more complicated.
A class can be a combination of multiple traits - for the sake of example, a single class that I want to detect can be (shape=round AND color=red AND margin=clear)
On different experiment I may use different combinations and class definitions.
That's why I strongly prefer to work at the indices level, while having the data stored in a single place.

For now I went with something a bit similar to your suggestion, I'm storing the data on a single directory, in raw format (all inputs and outputs are numpy tensors serialized and stored as raw files).
I create a single big filenames list, which already has the oversampling and in general the sampling proportions that I need.
I store one tf.constant() of the input list and one tf.constant() per output.
Then I use Dataset.from_tensor_slices() and use the dataset.map() api to shuffle them, actually read the raw tensors, augment etc.

The down side of my approach is that:

  1. I don't guarantee minibatch exact balance, just that the overall samples sampled during 1 epoch are balanced (but could theoretically get extreme cases like one minibatch containing only one class)
  2. I didn't find a tensorflow operation to load compressed raw files (WITHOUT using tfrecords) so this is probably slower and takes more disk space than needed

To solve #1 I can probably switch to zip-ing datasets.
I don't know yet how to solve #2 - I could probably write a tensorflow C++ operation for it but currently have too many things prioritized before that.

Thanks again for the assistance! :)

@tensorflowbutler

This comment has been minimized.

Member

tensorflowbutler commented Dec 20, 2017

It has been 14 days with no activity and this issue has an assignee.Please update the label and/or status accordingly.

1 similar comment
@tensorflowbutler

This comment has been minimized.

Member

tensorflowbutler commented Jan 3, 2018

It has been 14 days with no activity and this issue has an assignee.Please update the label and/or status accordingly.

@tensorflowbutler

This comment has been minimized.

Member

tensorflowbutler commented Jan 18, 2018

Nagging Assignee: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

2 similar comments
@tensorflowbutler

This comment has been minimized.

Member

tensorflowbutler commented Feb 6, 2018

Nagging Assignee: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@tensorflowbutler

This comment has been minimized.

Member

tensorflowbutler commented Feb 20, 2018

Nagging Assignee: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@tensorflowbutler

This comment has been minimized.

Member

tensorflowbutler commented Mar 7, 2018

Nagging Assignee @mrry: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

3 similar comments
@tensorflowbutler

This comment has been minimized.

Member

tensorflowbutler commented Mar 25, 2018

Nagging Assignee @mrry: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@tensorflowbutler

This comment has been minimized.

Member

tensorflowbutler commented Apr 8, 2018

Nagging Assignee @mrry: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@tensorflowbutler

This comment has been minimized.

Member

tensorflowbutler commented Apr 23, 2018

Nagging Assignee @mrry: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@tensorflowbutler

This comment has been minimized.

Member

tensorflowbutler commented May 10, 2018

Nagging Assignee @mrry: It has been 16 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@mrry mrry removed their assignment May 10, 2018

@mrry

This comment has been minimized.

Contributor

mrry commented May 10, 2018

Since I'm not working on this feature (and still not exactly sure what it would entail), I'm going remove my assignment and open it up to "contributions welcome".

A couple of notes though:

  • tf.contrib.data.sample_from_datasets() has recently been added, and might be useful here (as a substitute for zipping datasets).
  • @saeta is looking at better support for random access to datasets (by maintaining an index over them) and might end up adding support that could be useful here too.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment