New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Support for None values in tf.contrib.data.Dataset #13865

Closed
nikonikolov opened this Issue Oct 20, 2017 · 5 comments

Comments

Projects
None yet
5 participants
@nikonikolov

nikonikolov commented Oct 20, 2017

It would be very handy if the Dataset API supports None types. The idea is to be able to use the same Iterator object for the training and the test datasets. As the training dataset contains labels and the test dataset does not, the only workaround I know at the moment is to use some dummy labels in order to make the two datasets compatible with the same Iterator. This can waste a lot of memory though and is not a clean solution. Instead, maybe it can be possible to create a Dataset from None, that behaves in a way such that its output_types and output_shapes are compatible with any other type and shape, but does not consume so much memory. Here is a quick example:

X_train = tf.contrib.data.Dataset.from_tensor_slices(X_train_data)
y_train = tf.contrib.data.Dataset.from_tensor_slices(y_train_data)
data_train = tf.conrib.Dataset.zip((X_train, y_train))

X_test = tf.contrib.data.Dataset.from_tensor_slices(X_test_data)
y_test = tf.contrib.data.Dataset.from_tensor_slices(None)
data_test = tf.conrib.Dataset.zip((X_test, y_test))

assert data_train.output_types == data_test.output_types
assert data_train.output_shapes == data_test.output_shapes

iterator = Iterator.from_structure(data_train.output_types, data_train.output_shapes)

train_init_op = iterator.make_initializer(data_train)
test_init_op = iterator.make_initializer(data_test)

# Build the graph ...

# Train network
with tf.Session() as sess:
  sess.run(train_init_op)
  # Train ...

# Run in prediction mode
with tf.Session() as sess:
  sess.run(test_init_op)
  # Get predictions ...

@drpngx

This comment has been minimized.

Member

drpngx commented Oct 21, 2017

@mrry WDYT?

@mrry

This comment has been minimized.

Contributor

mrry commented Oct 25, 2017

Hmm, I'm concerned that using None for this purpose doesn't give enough information. For example, let's take the code fragment defining data_test:

X_test = tf.contrib.data.Dataset.from_tensor_slices(X_test_data)
y_test = tf.contrib.data.Dataset.from_tensor_slices(None)
data_test = tf.conrib.Dataset.zip((X_test, y_test))

At this point, what is the value of data_test.output_types[1] and data_test.output_shapes[1]? It's only implied by code later in the snippet:

iterator = Iterator.from_structure(data_train.output_types, data_train.output_shapes)
# ...
test_init_op = iterator.make_initializer(data_test)

...and I don't see how the two assert statements would be able to pass. It seems like you want something "stronger" than an assert, which can go back a few lines in the code and cause y_test to have the appropriate type and shape.

Did you have an approach in mind that would make this work?


For the record though, you don't need to waste memory to tack on a dummy value to the test dataset. For example Dataset.from_tensors(0).repeat(len(X_test_data)) only allocates a single tensor containing 0 and returns shallow copies of it for each element. This allows it to be used with much larger datasets than ones that fit in a NumPy array.

@nikonikolov

This comment has been minimized.

nikonikolov commented Oct 30, 2017

In that case, seems like Dataset.from_tensors(0).repeat(len(X_test_data)) can often do the job.

As far as the feature request goes, maybe something like this will be more clear

y_test = tf.contrib.data.Dataset.empty(output_types, output_shapes)
# or
y_test = tf.contrib.data.Dataset.empty(y_train_data)

The idea is that just some empty Dataset is created, which can be compatible with y_train provided that the same output_types and output_shapes are supplied (or alternatively an example tensor for output_types and output_shapes is given). Not sure if this provides any sensible benefits other than maybe code clarity. Potentially it can be useful in some strange cases when the labels are vectors or something more than a single number (e.g. when learning some distributions for which not all the density is concentrated in a single bin).

Up to you to close the issue if you think there will be no real benefit.

@tensorflowbutler

This comment has been minimized.

Member

tensorflowbutler commented Dec 20, 2017

It has been 14 days with no activity and this issue has an assignee.Please update the label and/or status accordingly.

@mrry

This comment has been minimized.

Contributor

mrry commented Jan 3, 2018

Catching up on old issues: In the interests of keeping the tf.data API as compact as possible, and since we have a workaround that's not too bad, I'm going to close this out.

@mrry mrry closed this Jan 3, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment