Join GitHub today
DataSet user provided shuffled order #14518
It would be very useful if a user will be able to provide the order of examples within a dataset (with repetitions allowed, as only indices are shuffled).
I assume it can be somehow supported by zip-ing together different DataSets, but it would be MUCH easier and more flexible if we could just pass a list of indices. Probably light as well as it shouldn't be a big deal passing one list per epoch.
Please tell me if this feature already exists, and if not, please add it :)
Can you sketch the kind of syntax you'd like for this feature? Do you assume that the entire dataset to be shuffled fits in memory?
I'm not sure whether this should be part of the API or a "standard recipe" that we document, but that would depend on how complicated the recipe ends up being....
For example, if we required the data to fit into memory, something like the following would work:
x = ... # Tensor of shape [n, ...] containing all the feature values y = ... # Tensor of shape [n] containing all the labels order = ... # Tensor of shape [?] containing the desired order dataset = tf.data.Dataset.from_tensor_slices(order).map(lambda i: (x[i], y[i]))
Thanks for the quick response!
In the current syntax, to shuffle your elements, you can do something like:
ds = TFRecordDataset([filename]) ds = ds.Shuffle(...)
But that's not useful for unbalanced samples classes scenarios.
My thought was adding something like this:
ds = TFRecordDataset([filename]) examples_order = [0,1,2,4,5,0,3,2,0,3,4,2,0,................] #this was calculated by the user as the desired order (notice the repetition of examples) ds = ds.SetExamplesOrder(examples_order)
An example of a common scenario where this is useful:
Using this syntax, you can manually use whatever logic you prefer, and all that needs to be passed is the indices list.
Another option is to (also?) allow the user to provide custom shuffling python function. This function can be activated before every new epoch if requested.
def my_foo(seed): ... any user logic that returns a list of indices ... return examples_order ds = TFRecordDataset([filename]) ds = ds.CustomShuffling(my_foo)
Now, as I mentioned previously, you might say that this is achievable by creating few DataSet instances and zip-ing them together, however, I feel that it complicates things a bit too much, especially since it's comfortable to have all data in a single TFRecords file, and I don't see why not give the user the ability to provide whatever custom indices order he/she wants.
I believe that this would make the API much more useful especially for people who are outside the "natural images" domain. For example - the medical domain which is in its nature highly unbalanced.
In case you need a workaround for balancing, here is what I did:
You can then do stuff like give me a batch of 10 classes with 5 samples per class for instance.
But this should definitely be handled better in Tensorflow.
Thanks @jmaye !
For now I went with something a bit similar to your suggestion, I'm storing the data on a single directory, in raw format (all inputs and outputs are numpy tensors serialized and stored as raw files).
The down side of my approach is that:
To solve #1 I can probably switch to zip-ing datasets.
Thanks again for the assistance! :)
1 similar comment
2 similar comments
3 similar comments
Since I'm not working on this feature (and still not exactly sure what it would entail), I'm going remove my assignment and open it up to "contributions welcome".
A couple of notes though: