New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Redesigning TensorFlow's input pipelines #7951

Closed
mrry opened this Issue Feb 28, 2017 · 134 comments

Comments

Projects
None yet
@mrry
Contributor

mrry commented Feb 28, 2017

[TL;DR: We're designing a new input pipeline API for TensorFlow, and we'd like to collect your feature requests on this issue.]

We've noticed that one of the biggest challenges in getting started with TensorFlow is how to load your own data into your programs. While TensorFlow has several methods that can be used to build complex input pipelines (such as tf.train.string_input_producer(), tf.train.batch(), etc.), they were designed for a particular use case (processing a static set of files repeatedly), and the average user experience with these methods is not great. For example:

  • Once you reach the end of a pipeline, it becomes closed and you can never use it again in the same session. This requires users to use unnatural workarounds—with control flow or multiple sessions—to get a signal after processing an entire epoch, or switch between processing two datasets (e.g. training and validation data) in the same program.
    • See #2514 and #4535 for feature requests about handling multiple epochs.
    • See #7902 and numerous Stack Overflow questions for examples of processing different datasets in the same program.
  • The current pipelines use TensorFlow queues and multiple Python threads, which can lead to poor performance (lock contention in the queues and the Python GIL) and hard-to-understand exceptions (tf.errors.OutOfRangeError).
  • The pipelines behave poorly if you forget to call tf.train.start_queue_runners(sess): in fact, they hang indefinitely and deadlock the user program.

We're decided to start from a clean slate and redesign the input pipeline API. The existing methods will remain until TF 2.0 (at least), but we are planning to add a new set of methods for loading and manipulating datasets. We're still preparing a detailed design, which we plan to share soon, but we anticipate that there will be two new APIs:

  • A Dataset represents a collection of data elements. Each element can be a tuple of one or more tensors (e.g. an image and its label). We will provide methods for creating datasets from tensors, and deriving them from another dataset (e.g. by slicing its elements, repeating its elements, shuffling its elements, batching its elements, mapping a function over its elements, etc.).
  • An Iterator can be created from a Dataset. An iterator represents the current position within a dataset, and exposes an operation (like tf.QueueBase.dequeue()) that can be run to get the next element. There will be explicit operations for initializing an iterator, so that it can be reused after you have processed all of the elements in a dataset.

A similar pattern turns up in many different settings, including Java's Stream API, Scala's collections (and hence Spark's RDDs), and .NET's Language Integrated Query.

We're announcing this plan early because we want to collect feedback on what features you—as TensorFlow users—would like to see in an input pipeline API. What other pain points have we missed? What features do you miss from other systems? What other suggestions do you have?

We look forward to hearing from you!

@kmhofmann

This comment has been minimized.

Show comment
Hide comment
@kmhofmann

kmhofmann Feb 28, 2017

Contributor

A must-have for one of our use cases is ad-hoc creation of data elements via a callback function (which creates tensors on the fly, e.g. using py_func() or through some other means).

More specifically, we currently have a use case where we employ two queues; an outer one, using a string_input_producer (with shuffling), where each string denotes/points to a "data set", and the inner queue is then produced by generating a variable amount of samples from each "data set". Which and how many samples are generated differs per epoch (potentially conditional on past training behavior). Actually, we don't even use the nomenclature of an epoch anymore, since the same data is never seen twice, and above mentioned generation/sampling goes beyond the usual data augmentation.

Long story short: With a slightly out-of-the-ordinary use case, we've been hit by pretty much all of the problems you have mentioned above, and our workarounds have not been pretty. We'd be extremely happy to see a very flexible mechanism, where such cases are supported, and data generation doesn't have to be shoehorned into forced-upon concepts like epochs, finitely repeating queues, etc. (although they can be modeled by its primitives).
I am not sure how well the planned Dataset/Iterator API would support this.

Edit: Things we still need of course, include multi-threaded data generation, and multi-threaded random shuffle producer-consumer queues. But without the bane of GIL -- maybe via easy C++/Eigen hooks and thread control on the native side? Back and forth, via pybind?

Edit2: The new input pipeline should also take support for variable-sized tensors (i.e. different per example) into account, for both training and inference, e.g. in a fully-convolutional setting.

Contributor

kmhofmann commented Feb 28, 2017

A must-have for one of our use cases is ad-hoc creation of data elements via a callback function (which creates tensors on the fly, e.g. using py_func() or through some other means).

More specifically, we currently have a use case where we employ two queues; an outer one, using a string_input_producer (with shuffling), where each string denotes/points to a "data set", and the inner queue is then produced by generating a variable amount of samples from each "data set". Which and how many samples are generated differs per epoch (potentially conditional on past training behavior). Actually, we don't even use the nomenclature of an epoch anymore, since the same data is never seen twice, and above mentioned generation/sampling goes beyond the usual data augmentation.

Long story short: With a slightly out-of-the-ordinary use case, we've been hit by pretty much all of the problems you have mentioned above, and our workarounds have not been pretty. We'd be extremely happy to see a very flexible mechanism, where such cases are supported, and data generation doesn't have to be shoehorned into forced-upon concepts like epochs, finitely repeating queues, etc. (although they can be modeled by its primitives).
I am not sure how well the planned Dataset/Iterator API would support this.

Edit: Things we still need of course, include multi-threaded data generation, and multi-threaded random shuffle producer-consumer queues. But without the bane of GIL -- maybe via easy C++/Eigen hooks and thread control on the native side? Back and forth, via pybind?

Edit2: The new input pipeline should also take support for variable-sized tensors (i.e. different per example) into account, for both training and inference, e.g. in a fully-convolutional setting.

@mrry

This comment has been minimized.

Show comment
Hide comment
@mrry

mrry Feb 28, 2017

Contributor

@kmhofmann We'll certainly support tf.py_func() inside a new-style input pipeline (as well as, in general, compositions of any other TensorFlow ops). I'd like to understand more about your use case, though. How frequently do you move from one outer "data set" to the next? Are there any specific operations that you perform at the end of a "data set" or can your training loop handle the concatenation of records from different "data sets"?

We're planning to have a few nested iteration primitives, so that you can write a function mapping an element (e.g. a string representing your outer "data set") to a Dataset (representing the records in that "data set") and then flattening them down to a single Dataset. (Think SelectMany() in C#, flatMap() in Java and Scala.) So I think you could implement your logic for sampling from a "data set" in one of these flatMap() functions.

Let me know if any of this is unclear!

Contributor

mrry commented Feb 28, 2017

@kmhofmann We'll certainly support tf.py_func() inside a new-style input pipeline (as well as, in general, compositions of any other TensorFlow ops). I'd like to understand more about your use case, though. How frequently do you move from one outer "data set" to the next? Are there any specific operations that you perform at the end of a "data set" or can your training loop handle the concatenation of records from different "data sets"?

We're planning to have a few nested iteration primitives, so that you can write a function mapping an element (e.g. a string representing your outer "data set") to a Dataset (representing the records in that "data set") and then flattening them down to a single Dataset. (Think SelectMany() in C#, flatMap() in Java and Scala.) So I think you could implement your logic for sampling from a "data set" in one of these flatMap() functions.

Let me know if any of this is unclear!

@22csnyder

This comment has been minimized.

Show comment
Hide comment
@22csnyder

22csnyder Mar 1, 2017

Oh good timing! Now I can stop writing my own (horrible) dataset class. Many of the things said already resonate with my experience.

To the extent possible, I would like to code dataset-independent tensorflow computations. I don't want to have 3 different gan classes: each with their own create graph and fit methods, simply because one dataset doesn't fit in memory, the other is an np.array, and the other is generated on the fly.

The use case that affects me the most is [do n times: train for k iter/epoch, validate model, repeat]. There are clear problems with queues like you said. A minor issue for which I offer no solution is that while the scheduling(how long to train for before validate) is done perhaps by some method of a model class that I would like to be dataset-independent, whether it makes sense to talk in terms of iter or epoch is determined by the dataset--ruining some of the independence.

Some other ideas I jotted down while brainstorming my own class:

  • The dataset class (and not the model) should probably the one to have batch_size passed to it. It's awkward to ask for batch_size as a parameter during fitting and during dataset/queue creation, and ideally the compute graph doesn't have batch_size baked in.
  • A "verbose" dataset class should keep track of it's own statistics. It should maintain its own counters(tensors) that keep track of iterations, samples, and epochs. In most use cases I imagine these being restored the same time model parameters are restored.

22csnyder commented Mar 1, 2017

Oh good timing! Now I can stop writing my own (horrible) dataset class. Many of the things said already resonate with my experience.

To the extent possible, I would like to code dataset-independent tensorflow computations. I don't want to have 3 different gan classes: each with their own create graph and fit methods, simply because one dataset doesn't fit in memory, the other is an np.array, and the other is generated on the fly.

The use case that affects me the most is [do n times: train for k iter/epoch, validate model, repeat]. There are clear problems with queues like you said. A minor issue for which I offer no solution is that while the scheduling(how long to train for before validate) is done perhaps by some method of a model class that I would like to be dataset-independent, whether it makes sense to talk in terms of iter or epoch is determined by the dataset--ruining some of the independence.

Some other ideas I jotted down while brainstorming my own class:

  • The dataset class (and not the model) should probably the one to have batch_size passed to it. It's awkward to ask for batch_size as a parameter during fitting and during dataset/queue creation, and ideally the compute graph doesn't have batch_size baked in.
  • A "verbose" dataset class should keep track of it's own statistics. It should maintain its own counters(tensors) that keep track of iterations, samples, and epochs. In most use cases I imagine these being restored the same time model parameters are restored.
@TimZaman

This comment has been minimized.

Show comment
Hide comment
@TimZaman

TimZaman Mar 1, 2017

Contributor
  • Most importantly we need to address the dequeueing overhead. I've seen dozens and dozens of cases where (in the profiler; iirc MEMCpyWhatever) was really slow. This was mostly an issue where the GPU would get the data from the CPU.
  • It would be great if there is still a way to have an input feed that comes from multi-threaded or multi-processing python. The following is a great and reliable hack to do currently do that:
    session.run(enqueue_op, feed_dict=$some_numpy_batch_input) Where you can asynchronously feed the queue from python.
  • It would also be wow to have GPU resident queues.
Contributor

TimZaman commented Mar 1, 2017

  • Most importantly we need to address the dequeueing overhead. I've seen dozens and dozens of cases where (in the profiler; iirc MEMCpyWhatever) was really slow. This was mostly an issue where the GPU would get the data from the CPU.
  • It would be great if there is still a way to have an input feed that comes from multi-threaded or multi-processing python. The following is a great and reliable hack to do currently do that:
    session.run(enqueue_op, feed_dict=$some_numpy_batch_input) Where you can asynchronously feed the queue from python.
  • It would also be wow to have GPU resident queues.
@inflation

This comment has been minimized.

Show comment
Hide comment
@inflation

inflation Mar 1, 2017

Contributor

Good point. One thing is that currently the queue operations are "baked in" the computation graph, so it's hard to modify anything on the go. A higher abstraction can make it much easier without considering using control flows or other hacks.

Contributor

inflation commented Mar 1, 2017

Good point. One thing is that currently the queue operations are "baked in" the computation graph, so it's hard to modify anything on the go. A higher abstraction can make it much easier without considering using control flows or other hacks.

@ppwwyyxx

This comment has been minimized.

Show comment
Hide comment
@ppwwyyxx

ppwwyyxx Mar 2, 2017

Contributor

For a lot of my use cases, my input data is either 1. not on the file system, or 2. require complex preprocessing unavailable in TensorFlow. For both cases the existing input pipeline cannot help, so I use an input thread with enqueue/feed_dict + a training thread with dequeue a lot.

Let's assume that in most cases, you don't need to use the model itself to produce data (though sometimes it's not true). Then a solution I really like to see, is to be able to receive(similar to dequeue) tensors from a different process. (Like #4836)
The benefits are:

  1. Can use whatever tools/languages to produce data from any sources, as long as they're finally sent with certain message protocol.
  2. (theoretically) doesn't require an extra python thread in the training process.
  3. If the message protocol supports pub/sub, then (1) multiple training sessions can subcribe and reuse the same input data, which is very useful when trying new models. (2) data can be generated from different machines if the pre-processing is too heavy for a single CPU.

These are the features I really missed from a private system I've been using.
One disadvantage is that IPC/socket has smaller bandwidth than RAM but usually it's not a bottleneck.
I know this feature may be too far away, but I hope the new design could allow such possible future feature.

Contributor

ppwwyyxx commented Mar 2, 2017

For a lot of my use cases, my input data is either 1. not on the file system, or 2. require complex preprocessing unavailable in TensorFlow. For both cases the existing input pipeline cannot help, so I use an input thread with enqueue/feed_dict + a training thread with dequeue a lot.

Let's assume that in most cases, you don't need to use the model itself to produce data (though sometimes it's not true). Then a solution I really like to see, is to be able to receive(similar to dequeue) tensors from a different process. (Like #4836)
The benefits are:

  1. Can use whatever tools/languages to produce data from any sources, as long as they're finally sent with certain message protocol.
  2. (theoretically) doesn't require an extra python thread in the training process.
  3. If the message protocol supports pub/sub, then (1) multiple training sessions can subcribe and reuse the same input data, which is very useful when trying new models. (2) data can be generated from different machines if the pre-processing is too heavy for a single CPU.

These are the features I really missed from a private system I've been using.
One disadvantage is that IPC/socket has smaller bandwidth than RAM but usually it's not a bottleneck.
I know this feature may be too far away, but I hope the new design could allow such possible future feature.

@kmhofmann

This comment has been minimized.

Show comment
Hide comment
@kmhofmann

kmhofmann Mar 2, 2017

Contributor

@mrry One "data set" can be composed of anything between ~500-30,000 dynamically generated samples. At the moment, we don't perform specific operations at the end of each data set, i.e. everything gets put into the same (large) random shuffle queue, to mix samples between data sets. But I could also imagine cases where separation of sets might be helpful.

Contributor

kmhofmann commented Mar 2, 2017

@mrry One "data set" can be composed of anything between ~500-30,000 dynamically generated samples. At the moment, we don't perform specific operations at the end of each data set, i.e. everything gets put into the same (large) random shuffle queue, to mix samples between data sets. But I could also imagine cases where separation of sets might be helpful.

@yunjey

This comment has been minimized.

Show comment
Hide comment
@yunjey

yunjey Mar 3, 2017

Please support reading hdf5 file directly.

yunjey commented Mar 3, 2017

Please support reading hdf5 file directly.

@untom

This comment has been minimized.

Show comment
Hide comment
@untom

untom Mar 3, 2017

Personally, I'm a very big fan of the feed_dict method of feeding data into the graph. It is by far the most flexible, makes debugging way easier and makes for much simpler code. Thus my biggest wish would be to make that method more performant. Right now, this method starves my GPU all the time, which is a shame because most other DL frameworks (even those based on computational graphs) manage to make this much more performantly. I assume there is more copying/handling going on in the background than would be necessary.

untom commented Mar 3, 2017

Personally, I'm a very big fan of the feed_dict method of feeding data into the graph. It is by far the most flexible, makes debugging way easier and makes for much simpler code. Thus my biggest wish would be to make that method more performant. Right now, this method starves my GPU all the time, which is a shame because most other DL frameworks (even those based on computational graphs) manage to make this much more performantly. I assume there is more copying/handling going on in the background than would be necessary.

@nicolasdespres

This comment has been minimized.

Show comment
Hide comment
@nicolasdespres

nicolasdespres Mar 8, 2017

I am glad to see this initiative. The input pipeline is definitely the
steepest part of the learning curve.

I'd like:

  • A unified API to manage both in-memory (feed_dict) dataset and large one so that the same code scale and your model only have to talk to one API. Although, I have not use it yet, I liked what I read in the input pipeline documentation.
  • More iterators! They are great. Asynchronous iterators would be even better (see PEP492). Iterators implementing __len__ are great for progress report.
  • multiprocessing rather than threading
  • Please no Dataset class because, IMHO, the "dataset" concept is ill-defined. The Dataset class described in the original post already exists in Python: it is a list of tuples. And what is a "dataset", anyway? A collection of train/valid/test data or simply a collection of data? Is it just a file? directory? generator? Are each data item (input/target) couple? Is that always true? Is the dictionary part of the text dataset?
    The choice of the data container is driven by a lot of constrains depending on its size and the execution environment. Instead of a Dataset container, I would prefer to have a rich set of containers offering different trade-off with respect to memory/time complexity. In addition, I would like to have a rich set of iterators, splitters, loaders, dumpers, slicers, repeaters, servers, generators to actually work with data coming from various source.
  • The epoch concept does not have a clear semantic either. In my experience, it is best defined by epoch = global_step / steps_per_epoch and steps_per_epoch = dataset_size / batch_size.

Here my attempt to translate to small in-memory dataset some of the routines available in the TF's input pipeline for large dataset. Here some examples of what I would like to see available in TensorFlow:

These routines demonstrate how far you can go with just simple iterators over list of indices.

nicolasdespres commented Mar 8, 2017

I am glad to see this initiative. The input pipeline is definitely the
steepest part of the learning curve.

I'd like:

  • A unified API to manage both in-memory (feed_dict) dataset and large one so that the same code scale and your model only have to talk to one API. Although, I have not use it yet, I liked what I read in the input pipeline documentation.
  • More iterators! They are great. Asynchronous iterators would be even better (see PEP492). Iterators implementing __len__ are great for progress report.
  • multiprocessing rather than threading
  • Please no Dataset class because, IMHO, the "dataset" concept is ill-defined. The Dataset class described in the original post already exists in Python: it is a list of tuples. And what is a "dataset", anyway? A collection of train/valid/test data or simply a collection of data? Is it just a file? directory? generator? Are each data item (input/target) couple? Is that always true? Is the dictionary part of the text dataset?
    The choice of the data container is driven by a lot of constrains depending on its size and the execution environment. Instead of a Dataset container, I would prefer to have a rich set of containers offering different trade-off with respect to memory/time complexity. In addition, I would like to have a rich set of iterators, splitters, loaders, dumpers, slicers, repeaters, servers, generators to actually work with data coming from various source.
  • The epoch concept does not have a clear semantic either. In my experience, it is best defined by epoch = global_step / steps_per_epoch and steps_per_epoch = dataset_size / batch_size.

Here my attempt to translate to small in-memory dataset some of the routines available in the TF's input pipeline for large dataset. Here some examples of what I would like to see available in TensorFlow:

These routines demonstrate how far you can go with just simple iterators over list of indices.

@yaroslavvb

This comment has been minimized.

Show comment
Hide comment
@yaroslavvb

yaroslavvb Mar 8, 2017

Contributor

+1 to something like feed_dict. That's the only way to learn by interacting with external world (training robot arms, Atari games, Universe ).

It could be made more efficient by avoiding copies. Like PyTorch, whose Tensors share memory buffers with underlying numpy arrays

Contributor

yaroslavvb commented Mar 8, 2017

+1 to something like feed_dict. That's the only way to learn by interacting with external world (training robot arms, Atari games, Universe ).

It could be made more efficient by avoiding copies. Like PyTorch, whose Tensors share memory buffers with underlying numpy arrays

@MicaelCarvalho

This comment has been minimized.

Show comment
Hide comment
@MicaelCarvalho

MicaelCarvalho Mar 8, 2017

Contributor

I don't know TF as well as others here, so please take my comments with some skepticism:

  • With tf.py_func I was able to solve most of my input-related problems, like loading .mat files in a symbolic-ish manner. The one I'm currently struggling with is the integration of tf.train.batch with the ability of picking the source from which the input should come, for having train/val data in the same symbolic variable. #8168

    I understand these functions were initially thought for simple use cases, but it would be nice to have more control of the pipeline without the burden of managing everything (e.g. using tf.QueueBase.from_list but being forced to feed queues and manage threads kind of manually).

  • I'm not sure if TensorFlow optimizes the dequeue operation under the hood but, if not, I think we could greatly benefit from a parallel dequeue operation that charges data (i.e. next batch) into the GPU memory while it processes the previous data (i.e. current batch).

  • I think feed_dict-like solutions are not optimal for passing big chunks of data to the train function, like a batch of images, since they're basically a pause in the execution graph to force TF to interact with python's runtime. An in-graph solution sounds better, with pointers to guide the graph execution, like feed_dict={is_training = True} to indicate the input should come from the training pipeline, the model should set batchnorm and dropout (et al) to train mode etc. This way, TF could better optimize/parallelize the execution, and all solutions would scale.

  • The standard functions for creating batches apparently do not provide an index to indicate which batch we are processing. For example, a slice_input_producer receives the number of epochs to be generated but there seems to be no way of knowing the epoch of one sample without counting how many we have already evaluated.

Contributor

MicaelCarvalho commented Mar 8, 2017

I don't know TF as well as others here, so please take my comments with some skepticism:

  • With tf.py_func I was able to solve most of my input-related problems, like loading .mat files in a symbolic-ish manner. The one I'm currently struggling with is the integration of tf.train.batch with the ability of picking the source from which the input should come, for having train/val data in the same symbolic variable. #8168

    I understand these functions were initially thought for simple use cases, but it would be nice to have more control of the pipeline without the burden of managing everything (e.g. using tf.QueueBase.from_list but being forced to feed queues and manage threads kind of manually).

  • I'm not sure if TensorFlow optimizes the dequeue operation under the hood but, if not, I think we could greatly benefit from a parallel dequeue operation that charges data (i.e. next batch) into the GPU memory while it processes the previous data (i.e. current batch).

  • I think feed_dict-like solutions are not optimal for passing big chunks of data to the train function, like a batch of images, since they're basically a pause in the execution graph to force TF to interact with python's runtime. An in-graph solution sounds better, with pointers to guide the graph execution, like feed_dict={is_training = True} to indicate the input should come from the training pipeline, the model should set batchnorm and dropout (et al) to train mode etc. This way, TF could better optimize/parallelize the execution, and all solutions would scale.

  • The standard functions for creating batches apparently do not provide an index to indicate which batch we are processing. For example, a slice_input_producer receives the number of epochs to be generated but there seems to be no way of knowing the epoch of one sample without counting how many we have already evaluated.

@ErikGoldman

This comment has been minimized.

Show comment
Hide comment
@ErikGoldman

ErikGoldman Mar 8, 2017

right now there are two very divergent paths to getting data into Tensorflow: feed_dict and queues. queues are wonderful until you don't have a way to manipulate your data natively -- for example, if you want to load a .wav file, chop it into parts, and convert it to a spectrogram. at that point, you have to write a C++ op (doable, but a context switch + it makes a very inflexible pipeline) or pop back into Python land (slower, but very easy and flexible).

it seems like the best compromise between speed and flexibility is to create a TF queue and then make a bunch of Python threads that feed it with data. this allows you to do flexible data processing in Python (roughly parallelized on the CPU, apart from GIL issues) while maintaining some amount of speed benefit.

what if you just formalized that? the interface would be: push_data, end_of_data (for signaling the end of an epoch), and a dequeue_batch function that feeds the model. then your code could just load data in Python and stuff it onto the queue in parallel, while the model sits totally separate from all of that.

ErikGoldman commented Mar 8, 2017

right now there are two very divergent paths to getting data into Tensorflow: feed_dict and queues. queues are wonderful until you don't have a way to manipulate your data natively -- for example, if you want to load a .wav file, chop it into parts, and convert it to a spectrogram. at that point, you have to write a C++ op (doable, but a context switch + it makes a very inflexible pipeline) or pop back into Python land (slower, but very easy and flexible).

it seems like the best compromise between speed and flexibility is to create a TF queue and then make a bunch of Python threads that feed it with data. this allows you to do flexible data processing in Python (roughly parallelized on the CPU, apart from GIL issues) while maintaining some amount of speed benefit.

what if you just formalized that? the interface would be: push_data, end_of_data (for signaling the end of an epoch), and a dequeue_batch function that feeds the model. then your code could just load data in Python and stuff it onto the queue in parallel, while the model sits totally separate from all of that.

@jhseu

This comment has been minimized.

Show comment
Hide comment
@jhseu

jhseu Mar 9, 2017

Member

We should make feed_dict faster (likely by not copying the numpy.arrays like @yaroslavvb mentioned), but that's orthogonal to this change. No matter how much we optimize it, feed_dict will never be the fastest way to feed data into a training job.

Member

jhseu commented Mar 9, 2017

We should make feed_dict faster (likely by not copying the numpy.arrays like @yaroslavvb mentioned), but that's orthogonal to this change. No matter how much we optimize it, feed_dict will never be the fastest way to feed data into a training job.

@yaroslavvb

This comment has been minimized.

Show comment
Hide comment
@yaroslavvb

yaroslavvb Mar 9, 2017

Contributor

feed_dict specifically may not be essential. To be more precise, we need support for pipelines where learning is done in an online fashion, and training data is generated by a system responding to actions of a TensorFlow network (learning Atari simulator, robotics simulator, robot interacting with real world, etc). This is necessary for most of the applications at OpenAI, here's one example -- https://github.com/openai/universe-starter-agent

Contributor

yaroslavvb commented Mar 9, 2017

feed_dict specifically may not be essential. To be more precise, we need support for pipelines where learning is done in an online fashion, and training data is generated by a system responding to actions of a TensorFlow network (learning Atari simulator, robotics simulator, robot interacting with real world, etc). This is necessary for most of the applications at OpenAI, here's one example -- https://github.com/openai/universe-starter-agent

@jhseu

This comment has been minimized.

Show comment
Hide comment
@jhseu

jhseu Mar 9, 2017

Member

The fastest option would be to create a TensorFlow op that maintains state, takes actions as input, and generates the training data. Then add a placeholder to specify the action.

My guess is that you're looking for something that can be done completely in Python, though. There may be some mid-point between the two.

Member

jhseu commented Mar 9, 2017

The fastest option would be to create a TensorFlow op that maintains state, takes actions as input, and generates the training data. Then add a placeholder to specify the action.

My guess is that you're looking for something that can be done completely in Python, though. There may be some mid-point between the two.

@Mazecreator

This comment has been minimized.

Show comment
Hide comment
@Mazecreator

Mazecreator Mar 9, 2017

Contributor

I am not sure it this concept has been brought up yet, but I will at least put the problem in my own terms.

In dealing with RL problems and the training replay buffer, I couldn't find an easy way to use the Queues to speed up this feeding of samples through the feed_dict. Also, when randomly creating a sample set, it seemed like the samples were consumed when I wanted them left in the buffer.

What I was hoping to do is feed (possibly through feed_dict, or file) a Queue with a new sample and once the size of the buffer is exceeded, the oldest sample is removed from the buffer. So some concept of "sample age" would be nice. I am sure using a circular buffer will work to fix to a number of samples, but "age" might be of interest as well, maybe passed as part of the tuple, but in the RL case, simply the sequence of the sample being added might cover the age (FIFO).

Again, it may have just not been clear to me how to use the queues, but being able to randomly pull a mini-batch from this sample buffer and not remove the samples so a new set of samples can be collected (possibly with prior sampled examples) would be nice.

Contributor

Mazecreator commented Mar 9, 2017

I am not sure it this concept has been brought up yet, but I will at least put the problem in my own terms.

In dealing with RL problems and the training replay buffer, I couldn't find an easy way to use the Queues to speed up this feeding of samples through the feed_dict. Also, when randomly creating a sample set, it seemed like the samples were consumed when I wanted them left in the buffer.

What I was hoping to do is feed (possibly through feed_dict, or file) a Queue with a new sample and once the size of the buffer is exceeded, the oldest sample is removed from the buffer. So some concept of "sample age" would be nice. I am sure using a circular buffer will work to fix to a number of samples, but "age" might be of interest as well, maybe passed as part of the tuple, but in the RL case, simply the sequence of the sample being added might cover the age (FIFO).

Again, it may have just not been clear to me how to use the queues, but being able to randomly pull a mini-batch from this sample buffer and not remove the samples so a new set of samples can be collected (possibly with prior sampled examples) would be nice.

@lming

This comment has been minimized.

Show comment
Hide comment
@lming

lming Mar 10, 2017

I may not understand the distributed settings that TF data input pipeline API is targeting to solve. Is it possible to have a simple API design as Pytorch does: only three simple classes. I can pick up pytorch's dataset API in 5 minutes and it's good enough for all the popular academic datasets. http://pytorch.org/docs/data.html

It's great to see new efforts to solve the pain points in TF dataset API. Looking forward to a simple/beautiful/flexible API with minimum number of classes/concepts introduced. Thanks.

lming commented Mar 10, 2017

I may not understand the distributed settings that TF data input pipeline API is targeting to solve. Is it possible to have a simple API design as Pytorch does: only three simple classes. I can pick up pytorch's dataset API in 5 minutes and it's good enough for all the popular academic datasets. http://pytorch.org/docs/data.html

It's great to see new efforts to solve the pain points in TF dataset API. Looking forward to a simple/beautiful/flexible API with minimum number of classes/concepts introduced. Thanks.

@jhseu

This comment has been minimized.

Show comment
Hide comment
@jhseu

jhseu Mar 10, 2017

Member

@lming Yeah, the first two comments here cover that: by making a Dataset implementation that uses py_func, it'd be equivalent to the PyTorch implementation.

Member

jhseu commented Mar 10, 2017

@lming Yeah, the first two comments here cover that: by making a Dataset implementation that uses py_func, it'd be equivalent to the PyTorch implementation.

@taion

This comment has been minimized.

Show comment
Hide comment
@taion

taion Mar 10, 2017

Contributor

I second @lming's sentiment above.

Our biggest issue with the current data loading scheme is just that it's very complicated and involves a lot of new concepts.

We don't find it spectacularly difficult to write a multithreaded data loader ourselves in Python, and generally we don't find it overly difficult to ensure that our data loading and preprocessing runs sufficiently quickly that it doesn't actually bottleneck training.

Where we're stuck is that to optimally follow recommendations, we end up in an awkward situation, one of:

  • Using feed_dict and suffering any relevant performance hits
  • Feeding from a separate thread and dealing with some one-off queue boilerplate (except this didn't speed things up at all when we tried it)
  • Reimplementing our data loading and transformation pipeline with TF primitives, perhaps with py_func, but still using the TF API for managing queue runners

The Python threading API isn't perfect, but in general when we're doing mostly non-GIL-taking tasks in NumPy or whatever, the TF queue API seems more of a burden than a help.

Contributor

taion commented Mar 10, 2017

I second @lming's sentiment above.

Our biggest issue with the current data loading scheme is just that it's very complicated and involves a lot of new concepts.

We don't find it spectacularly difficult to write a multithreaded data loader ourselves in Python, and generally we don't find it overly difficult to ensure that our data loading and preprocessing runs sufficiently quickly that it doesn't actually bottleneck training.

Where we're stuck is that to optimally follow recommendations, we end up in an awkward situation, one of:

  • Using feed_dict and suffering any relevant performance hits
  • Feeding from a separate thread and dealing with some one-off queue boilerplate (except this didn't speed things up at all when we tried it)
  • Reimplementing our data loading and transformation pipeline with TF primitives, perhaps with py_func, but still using the TF API for managing queue runners

The Python threading API isn't perfect, but in general when we're doing mostly non-GIL-taking tasks in NumPy or whatever, the TF queue API seems more of a burden than a help.

@hanxiao

This comment has been minimized.

Show comment
Hide comment
@hanxiao

hanxiao Jul 9, 2017

Just want to add something here, I implemented a multiprocess-based data feeding pipeline for multi-task learning. It can achieve avg. GPU utilization >90% and quad-core CPU utilization >95%. Less prone to memory leak and particularly good for days-long training. Not saying it's perfect, but at least works much better than current TF queue API in my case. If anyone interested: https://hanxiao.github.io/2017/07/07/Get-10x-Speedup-in-Tensorflow-Multi-Task-Learning-using-Python-Multiprocessing/

hanxiao commented Jul 9, 2017

Just want to add something here, I implemented a multiprocess-based data feeding pipeline for multi-task learning. It can achieve avg. GPU utilization >90% and quad-core CPU utilization >95%. Less prone to memory leak and particularly good for days-long training. Not saying it's perfect, but at least works much better than current TF queue API in my case. If anyone interested: https://hanxiao.github.io/2017/07/07/Get-10x-Speedup-in-Tensorflow-Multi-Task-Learning-using-Python-Multiprocessing/

@PatWie

This comment has been minimized.

Show comment
Hide comment
@PatWie

PatWie Jul 10, 2017

That was already done in TensorPack for a while now by @ppwwyyxx. There you also get further speedup using ZMQ -- plus it has a nice interface using Python generators. For me, the way tensorpack handles input data, is the most elegant way. I hope to see something like this in a future TF.

PatWie commented Jul 10, 2017

That was already done in TensorPack for a while now by @ppwwyyxx. There you also get further speedup using ZMQ -- plus it has a nice interface using Python generators. For me, the way tensorpack handles input data, is the most elegant way. I hope to see something like this in a future TF.

@hanxiao

This comment has been minimized.

Show comment
Hide comment
@hanxiao

hanxiao Jul 10, 2017

@PatWie thanks for pointing this out! I just quickly checked @ppwwyyxx repo really awesome! Thanks again

hanxiao commented Jul 10, 2017

@PatWie thanks for pointing this out! I just quickly checked @ppwwyyxx repo really awesome! Thanks again

@xieqihui

This comment has been minimized.

Show comment
Hide comment
@xieqihui

xieqihui Jul 14, 2017

It would be great to have GPU resident queues.

xieqihui commented Jul 14, 2017

It would be great to have GPU resident queues.

@sjperkins

This comment has been minimized.

Show comment
Hide comment
@sjperkins

sjperkins Jul 14, 2017

Contributor

It would be great to have GPU resident queues.

@xieqihuiPG See StagingArea and MapStagingArea

Contributor

sjperkins commented Jul 14, 2017

It would be great to have GPU resident queues.

@xieqihuiPG See StagingArea and MapStagingArea

@Kismuz

This comment has been minimized.

Show comment
Hide comment
@Kismuz

Kismuz Jul 18, 2017

Would greatly appreciate:

  1. Efficient random sampling:
    Dataset.sample_random()
  2. Dynamical changing and resizing methods:
    Dataset.update(), Dataset.pop() etc., e.g. for creating streaming buffers, replay memory objects...
  3. Meta- and descriptive statistic integration into dataset object and supportive methods like Dataset.describe()
  4. Closer integration with HDF5 anyway

Kismuz commented Jul 18, 2017

Would greatly appreciate:

  1. Efficient random sampling:
    Dataset.sample_random()
  2. Dynamical changing and resizing methods:
    Dataset.update(), Dataset.pop() etc., e.g. for creating streaming buffers, replay memory objects...
  3. Meta- and descriptive statistic integration into dataset object and supportive methods like Dataset.describe()
  4. Closer integration with HDF5 anyway
@jrbtaylor

This comment has been minimized.

Show comment
Hide comment
@jrbtaylor

jrbtaylor Jul 19, 2017

#11591 We need efficient sampling/shuffling for large datasets

jrbtaylor commented Jul 19, 2017

#11591 We need efficient sampling/shuffling for large datasets

@sirfz

This comment has been minimized.

Show comment
Hide comment
@sirfz

sirfz Aug 8, 2017

What about supporting custom ops to create a Dataset? For example, let's say I have a Python function which returns a new batch on each call (a generator). I want to wrap this function using tf.py_func and use it to build a Dataset. This doesn't seem to be supported?

I currently use this method with tf.train.*batch* ops and it works nicely but I'd like to find a way to do this for evaluation as well (and figured maybe Dataset is a good way to do this with the "reintializable" iterator).

sirfz commented Aug 8, 2017

What about supporting custom ops to create a Dataset? For example, let's say I have a Python function which returns a new batch on each call (a generator). I want to wrap this function using tf.py_func and use it to build a Dataset. This doesn't seem to be supported?

I currently use this method with tf.train.*batch* ops and it works nicely but I'd like to find a way to do this for evaluation as well (and figured maybe Dataset is a good way to do this with the "reintializable" iterator).

@eaplatanios

This comment has been minimized.

Show comment
Hide comment
@eaplatanios

eaplatanios Aug 11, 2017

Contributor

@mrry This is great work and definitely very useful for creating nice learning APIs on top of TensorFlow. However, I have a couple main concerns:

  • I cannot see a way currently to "unzip" a dataset. Let's say we have a trainable model that has both a train/fit method and a infer/predict method. Let's call the type of the (potentially) nested structure of inputs to our model I and the type of training inputs, which are only needed when training (e.g., supervision labels), TI. In this case, we want the train method to accept datasets with elements of type (I, TI) (i.e., a tuple of I and TI) and the predict method to accept datasets with elements of type I or (I, TI) (in which case it would ignore the labels). We also want the model to only have one underlying graph, supporting all these types of input. The way I could see doing that was for the underlying model to construct two iterators (one with elements type I and one with type TI) and initialize them according to the provided datasets. However, if somebody provides a dataset with elements of type (I, TI) to the train method, there is no way to unzip this dataset and initialize both iterators. One has to use Dataset.map twice, which is not efficient (I think but please correct me if I'm wrong) and which may also not pull matching elements from the datasets (if each pull advances the current index in the original first dataset -- I'm not sure if that happens).
  • It would be nice to support iterators over tensors defined in other languages as @sirfz mentioned. I cannot see an efficient way to do that with the current API. Please correct me if I'm wrong but currently one would have to create a new TensorDataset per batch and re-initialize an existing iterator.
    I think @fchollet may be able to comment on my first point as currently my understanding is that they are thinking or creating an entirely new graph for training only, for such cases (third step described here).

Also, if my description is terribly unclear, please let me know and I'll try to clarify.

Contributor

eaplatanios commented Aug 11, 2017

@mrry This is great work and definitely very useful for creating nice learning APIs on top of TensorFlow. However, I have a couple main concerns:

  • I cannot see a way currently to "unzip" a dataset. Let's say we have a trainable model that has both a train/fit method and a infer/predict method. Let's call the type of the (potentially) nested structure of inputs to our model I and the type of training inputs, which are only needed when training (e.g., supervision labels), TI. In this case, we want the train method to accept datasets with elements of type (I, TI) (i.e., a tuple of I and TI) and the predict method to accept datasets with elements of type I or (I, TI) (in which case it would ignore the labels). We also want the model to only have one underlying graph, supporting all these types of input. The way I could see doing that was for the underlying model to construct two iterators (one with elements type I and one with type TI) and initialize them according to the provided datasets. However, if somebody provides a dataset with elements of type (I, TI) to the train method, there is no way to unzip this dataset and initialize both iterators. One has to use Dataset.map twice, which is not efficient (I think but please correct me if I'm wrong) and which may also not pull matching elements from the datasets (if each pull advances the current index in the original first dataset -- I'm not sure if that happens).
  • It would be nice to support iterators over tensors defined in other languages as @sirfz mentioned. I cannot see an efficient way to do that with the current API. Please correct me if I'm wrong but currently one would have to create a new TensorDataset per batch and re-initialize an existing iterator.
    I think @fchollet may be able to comment on my first point as currently my understanding is that they are thinking or creating an entirely new graph for training only, for such cases (third step described here).

Also, if my description is terribly unclear, please let me know and I'll try to clarify.

@tillahoffmann

This comment has been minimized.

Show comment
Hide comment
@tillahoffmann

tillahoffmann Aug 12, 2017

Contributor

The new input pipelines are great! But unfortunately, we are unable to use them for large-scale training because our data preprocessing is quite costly and needs to be distributed across multiple machines--or we just haven't figured out the right way to do it.

We have thus been using the old FIFOQueue interface in the following manner (pseudocode):

# Set up queues
kwargs = {'capacity': ..., 'dtypes': ..., 'names': ..., 'shapes': ...}
train_queue = tf.FIFOQueue(**kwargs)
valid_queue = tf.FIFOQueue(**kwargs)
queue_index = tf.Variable(0, trainable=False)
queue = tf.QueueBase.from_list(queue_index, [train_queue, valid_queue])
batch_size = ...
batch = queue.dequeue_many(batch_size)

# Build model
output = build_model(batch['X'])
loss = evaluate_loss(output, batch['y'])

# Fill queues
train_data_stream = some_iterable_for_training_data()
validation_data_stream = some_iterable_for_training_data()
start_filling_queues_in_background_thread(train_queue, train_data_stream)
start_filling_queues_in_background_thread(validation_queue, validation_data_stream)

Having two different queues with from_list allows us to switch between the training and validation queue by either setting the queue_index or feeding it in the feed_dict.

The some_interable_for_xxx_data are usually generators that get data from a bunch of workers sitting behind a load balancer (e.g. using ZeroMQ, RabbitMQ, or PubSub). This approach works well (because the queues provide a buffer) but we don't have any way of telling when the iterator is exhausted. Some workarounds are

  • closing the queue in the background thread such that a tf.errors.OutOfRangeError is raised when the queue is exhausted (but then we can't reopen it again #4535)
  • setting a timeout on the session.run of the training op and assuming that a timeout is due to the queue being exhausted (but the network connection might be down or our workers might be too slow)
  • counting the number of items we've processed and comparing with the expected number of items in the iterator (but that's fiddly and sometimes we don't even know how long the iterator is)
  • adding an exhausted field to the queue names and letting the background thread enqueue an item with exhausted=True together with an assertion around the dequeue operation (but using dequeue_many will dequeue elements from the next epoch if the number of items per epoch is not an integer multiple of the batch size, see also #2514)

None of these are satisfactory and it would be great to see either the ability to construct Datasets from python iterator with a queue for buffering or fix #4535 (which will automatically fix #2514).

Looking forward to hear whether we've just not been using the datasets API right.

Contributor

tillahoffmann commented Aug 12, 2017

The new input pipelines are great! But unfortunately, we are unable to use them for large-scale training because our data preprocessing is quite costly and needs to be distributed across multiple machines--or we just haven't figured out the right way to do it.

We have thus been using the old FIFOQueue interface in the following manner (pseudocode):

# Set up queues
kwargs = {'capacity': ..., 'dtypes': ..., 'names': ..., 'shapes': ...}
train_queue = tf.FIFOQueue(**kwargs)
valid_queue = tf.FIFOQueue(**kwargs)
queue_index = tf.Variable(0, trainable=False)
queue = tf.QueueBase.from_list(queue_index, [train_queue, valid_queue])
batch_size = ...
batch = queue.dequeue_many(batch_size)

# Build model
output = build_model(batch['X'])
loss = evaluate_loss(output, batch['y'])

# Fill queues
train_data_stream = some_iterable_for_training_data()
validation_data_stream = some_iterable_for_training_data()
start_filling_queues_in_background_thread(train_queue, train_data_stream)
start_filling_queues_in_background_thread(validation_queue, validation_data_stream)

Having two different queues with from_list allows us to switch between the training and validation queue by either setting the queue_index or feeding it in the feed_dict.

The some_interable_for_xxx_data are usually generators that get data from a bunch of workers sitting behind a load balancer (e.g. using ZeroMQ, RabbitMQ, or PubSub). This approach works well (because the queues provide a buffer) but we don't have any way of telling when the iterator is exhausted. Some workarounds are

  • closing the queue in the background thread such that a tf.errors.OutOfRangeError is raised when the queue is exhausted (but then we can't reopen it again #4535)
  • setting a timeout on the session.run of the training op and assuming that a timeout is due to the queue being exhausted (but the network connection might be down or our workers might be too slow)
  • counting the number of items we've processed and comparing with the expected number of items in the iterator (but that's fiddly and sometimes we don't even know how long the iterator is)
  • adding an exhausted field to the queue names and letting the background thread enqueue an item with exhausted=True together with an assertion around the dequeue operation (but using dequeue_many will dequeue elements from the next epoch if the number of items per epoch is not an integer multiple of the batch size, see also #2514)

None of these are satisfactory and it would be great to see either the ability to construct Datasets from python iterator with a queue for buffering or fix #4535 (which will automatically fix #2514).

Looking forward to hear whether we've just not been using the datasets API right.

@rasmusbergpalm

This comment has been minimized.

Show comment
Hide comment
@rasmusbergpalm

rasmusbergpalm Aug 15, 2017

I think the queues are nice enough. I'd like to see two things improved though:

An easier way of inputting data from native python other than using placeholders, and managing threads.

Maybe a class InputQueue(delegate, fn, n_filler_threads) that takes a tensorflow queue delegate and a python function fn. fn returns a (possibly nested) tuple of np.array or lists. The InputQueue starts n_filler_threads that calls fn and puts these on the delegate. The threads are daemons so shuts down when the main process does.

Anyway, that's just my thoughts. It's probably a lot harder than this due to the static requirements of tensorflow. Maybe you just have to provide the sizes when you create the delegate.

rasmusbergpalm commented Aug 15, 2017

I think the queues are nice enough. I'd like to see two things improved though:

An easier way of inputting data from native python other than using placeholders, and managing threads.

Maybe a class InputQueue(delegate, fn, n_filler_threads) that takes a tensorflow queue delegate and a python function fn. fn returns a (possibly nested) tuple of np.array or lists. The InputQueue starts n_filler_threads that calls fn and puts these on the delegate. The threads are daemons so shuts down when the main process does.

Anyway, that's just my thoughts. It's probably a lot harder than this due to the static requirements of tensorflow. Maybe you just have to provide the sizes when you create the delegate.

shoeffner added a commit to shoeffner/ann3depth that referenced this issue Aug 17, 2017

Reverting test feature.
It is currently not well supported with MonitoredTrainingSessions.
See especially:
tensorflow/tensorflow#7951

With the previous approach training "restarted" at step 0 with the test
because the graph changed which led to conflicting summaries.
@tengerye

This comment has been minimized.

Show comment
Hide comment
@tengerye

tengerye Aug 21, 2017

I am using the new api Dataset now. But still find the problem that how to dynamically feed data to the Dataset. There are two similar questions in here and here@albertz.

As you can see, the real-world problems are more than just feeding into a series of images or texts. So I would really appreciate if you could let me to feed the data freely in terms of when and how.

I can image two options. One is efficient distributed reading through feed_dict. Although it is slow, but with multi-processing, it is just a matter of machine. The other one is to wrap some mature and widely accepted implementation.

tengerye commented Aug 21, 2017

I am using the new api Dataset now. But still find the problem that how to dynamically feed data to the Dataset. There are two similar questions in here and here@albertz.

As you can see, the real-world problems are more than just feeding into a series of images or texts. So I would really appreciate if you could let me to feed the data freely in terms of when and how.

I can image two options. One is efficient distributed reading through feed_dict. Although it is slow, but with multi-processing, it is just a matter of machine. The other one is to wrap some mature and widely accepted implementation.

@suiyuan2009

This comment has been minimized.

Show comment
Hide comment
@suiyuan2009

suiyuan2009 Aug 21, 2017

Contributor

use placeholder as input to a queue, and the model reads inputs from the queue, then use a session run thread to feed inputs(maybe produced by hadoop mapreduce) to the queue. use staging area you can even hide all preprocessing and input time.

Contributor

suiyuan2009 commented Aug 21, 2017

use placeholder as input to a queue, and the model reads inputs from the queue, then use a session run thread to feed inputs(maybe produced by hadoop mapreduce) to the queue. use staging area you can even hide all preprocessing and input time.

caisq pushed a commit to caisq/tensorflow that referenced this issue Aug 21, 2017

`Dataset.from_generator()` constructs a dataset from a Python generator.
With this change, it becomes possible to use a Python generator as the source
dataset for a `tf.contrib.data` input pipeline. This enables easier integration
with non-TensorFlow data sources. The generator can yield a nested structure of
NumPy arrays, or values convertible to NumPy arrays.

This addresses a concern raised in issue #7951.

PiperOrigin-RevId: 165663857
@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

bhack Aug 22, 2017

Contributor

I'm trying to test the example in the doc

But seems that this call is passing only 1 argument to the function:

dataset = dataset.map(_parse_function)

Instead the function is defined with two parameter

Contributor

bhack commented Aug 22, 2017

I'm trying to test the example in the doc

But seems that this call is passing only 1 argument to the function:

dataset = dataset.map(_parse_function)

Instead the function is defined with two parameter

@ahundt

This comment has been minimized.

Show comment
Hide comment
@ahundt

ahundt Aug 23, 2017

@eaplatanios one relevant PR for zip/unzip is #10837

ahundt commented Aug 23, 2017

@eaplatanios one relevant PR for zip/unzip is #10837

jhseu pushed a commit to jhseu/tensorflow that referenced this issue Aug 23, 2017

Merge changes from github.
END_PUBLIC

---
Commit 575bd01 authored by Vijay Vasudevan<vrv@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Remove /replica:0 declaration in device functions and allow them
to be freely bound based on cluster names present.

When more than one value matches, it will choose the first
lexicographically available device that matches the specification,
which in practice will do pretty much the same thing as hardcoding
/replica:0.

PiperOrigin-RevId: 165766815

---
Commit d685bbc authored by Alexandre Passos<apassos@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Benchmarks with backprop enabled (and removes overhead).

Before:
np.array([[3]])                          took 1.50us (30000 iterations)
Tensor([[3]])                            took 16.30us (30000 iterations)
MatMul [2, 2]: np.dot                         took 0.61us (30000 iterations)
MatMul [2, 2]: tf.matmul                      took 60.53us (30000 iterations)
MatMul [2, 2]: gen_math_ops.mat_mul           took 25.72us (30000 iterations)
MatMul [2, 2]: TFE_Py_Execute                 took 2.82us (30000 iterations)
MatMul [2, 2]: defun(tf.matmul)               took 45.70us (30000 iterations)
MatMul [100, 784]: np.dot                         took 383.32us (1000 iterations)
MatMul [100, 784]: tf.matmul                      took 350.35us (1000 iterations)
MatMul [100, 784]: gen_math_ops.mat_mul           took 315.97us (1000 iterations)
MatMul [100, 784]: TFE_Py_Execute                 took 249.42us (1000 iterations)
MatMul [100, 784]: defun(tf.matmul)               took 280.95us (1000 iterations)

If backprop is enabled:
np.array([[3]])                          took 0.83us (30000 iterations)
Tensor([[3]])                            took 15.21us (30000 iterations)
MatMul [2, 2]: np.dot                         took 0.63us (30000 iterations)
MatMul [2, 2]: tf.matmul                      took 76.31us (30000 iterations)
MatMul [2, 2]: gen_math_ops.mat_mul           took 38.66us (30000 iterations)
MatMul [2, 2]: TFE_Py_Execute                 took 2.31us (30000 iterations)
MatMul [2, 2]: defun(tf.matmul)               took 51.96us (30000 iterations)
MatMul [100, 784]: np.dot                         took 378.34us (1000 iterations)
MatMul [100, 784]: tf.matmul                      took 352.09us (1000 iterations)
MatMul [100, 784]: gen_math_ops.mat_mul           took 364.28us (1000 iterations)
MatMul [100, 784]: TFE_Py_Execute                 took 350.68us (1000 iterations)
MatMul [100, 784]: defun(tf.matmul)               took 377.19us (1000 iterations)

After:
np.array([[3]])                          took 0.86us (30000 iterations)
Tensor([[3]])                            took 15.19us (30000 iterations)
MatMul [2, 2]: np.dot                         took 0.60us (30000 iterations)
MatMul [2, 2]: tf.matmul                      took 64.51us (30000 iterations)
MatMul [2, 2]: gen_math_ops.mat_mul           took 28.34us (30000 iterations)
MatMul [2, 2]: TFE_Py_Execute                 took 2.38us (30000 iterations)
MatMul [2, 2]: defun(tf.matmul)               took 48.50us (30000 iterations)
MatMul [100, 784]: np.dot                         took 475.27us (1000 iterations)
MatMul [100, 784]: tf.matmul                      took 399.50us (1000 iterations)
MatMul [100, 784]: gen_math_ops.mat_mul           took 307.80us (1000 iterations)
MatMul [100, 784]: TFE_Py_Execute                 took 272.83us (1000 iterations)
MatMul [100, 784]: defun(tf.matmul)               took 350.06us (1000 iterations)
PiperOrigin-RevId: 165765641

---
Commit d902bab authored by David Majnemer<majnemer@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
[XLA] Algebraic simplifier incorrectly transformed convolutions into bitcasts

PiperOrigin-RevId: 165765575

---
Commit 8e78e10 authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
disable test temporarily

PiperOrigin-RevId: 165763204

---
Commit a271c37 authored by Benoit Steiner<bsteiner@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Small improvements to the arithmetic optimizer

PiperOrigin-RevId: 165760972

---
Commit b640959 authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Convert some tests to cover both eager and graph.

PiperOrigin-RevId: 165760364

---
Commit 5ead764 authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Reduce XLA compile time by ~7% for a convolutional image model:

* Added CompactPointerSet<T>, which is optimized for set size <= 1.
* Changed expensive CHECKs to DCHECKS in buffer_assignment.cc
* Reserve space in DFS state array before starting DFS.
* Use unsigned arithmetic in DFS state maintenance.
* HloInstruction:
  - Moved frequently used fields to start for better cache locality.
  - Use InlinedVector instead of vector for operand array.
  - Use InlinedVector instead of vector for DFS stack.
* Pre-compute "is array" and "is tuple" for LogicalBuffer.
* PointsToSet:
  - Combine two ShapeTrees into one.
  - Use CompactPointerSet instead of std::set to hold sources.
  - Use CompactPointerSet instead of std::set to hold flattened buffers.
* ShapeTree: use unique_ptr instead of optional for shape storage
  (reduces size and destruction overhead).
* Add proper const qualifiers to some FlatSet iterator methods.

Co-author=jeff
PiperOrigin-RevId: 165759117

---
Commit a0544b0 authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Make TPU symbols more easily accessible from contrib.

PiperOrigin-RevId: 165753322

---
Commit cdc08af authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Slightly relax numeric tolerance for sinlge precision tests of matrix_solve_ls (and tighten it for double precision).

PiperOrigin-RevId: 165750936

---
Commit eebcc86 authored by Jianwei Xie<xiejw@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Fixed the race condition between multi eval step increments.

PiperOrigin-RevId: 165750595

---
Commit bbc0b84 authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Go: Update generated wrapper functions for TensorFlow ops.

PiperOrigin-RevId: 165748384

---
Commit 65f87c9 authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Change device string in RecvNodeDescriptor in VirtualScheduler from const
reference to const as the RecvNodeDescriptor (and cached_recv_nodes map)
outlives device string from the NodeDef.

PiperOrigin-RevId: 165748244

---
Commit 57b0276 authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Update ops-related pbtxt files.

PiperOrigin-RevId: 165747467

---
Commit 64e5442 authored by Derek Murray<mrry@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
[tf.contrib.data] Fix nested dictionary handling in dataset elements.

Backports recent changes to the core version of the nest.py library.

Fixes #12372.

PiperOrigin-RevId: 165746517

---
Commit 378463a authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Make tf.eye accept Python integer shapes and avoid generating unnecessary shape handling ops.
Clean up test and add tests with placeholders.

PiperOrigin-RevId: 165746090

---
Commit 109ecf8 authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Add support for complex in matrix_solve_ls_op.
Split into separate files for each data type to speed up build.

PiperOrigin-RevId: 165744539

---
Commit 5144130 authored by Alexandre Passos<apassos@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Internal change.

PiperOrigin-RevId: 165737455

---
Commit d0cb32c authored by Alexandre Passos<apassos@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Docstring for ResourceVariable.

PiperOrigin-RevId: 165735441

---
Commit 32f4c5b authored by Chris Leary<leary@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
[XLA] Add IsFinite op in tf2xla.

PiperOrigin-RevId: 165734702

---
Commit 5f5c3eb authored by Mark Daoust<markdaoust@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Move "supervisor.md" from programmer's guide to api_guides.

PiperOrigin-RevId: 165732026

---
Commit d001b58 authored by Derek Murray<mrry@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
[tf.contrib.data] Fix handling of multi-output tf.py_func() in Dataset.map().

If the `map_func` returns a list of tensors, the current code will
attempt to stack it into a single tensor and raise an unintuitive
error. Some multi-output ops (such as `tf.py_func()`) return lists of
typically-not-stackable tensors. This change treats lists returned
from `map_func` as tuples; users who were relying on this
auto-stacking behavior should manually call `tf.stack()` (or
`tf.convert_to_tensor()`) on the list being returned.

Fixes #12396.

PiperOrigin-RevId: 165731970

---
Commit e6c60fb authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Fix flakyness, sometimes the op takes ms to run.

PiperOrigin-RevId: 165728705

---
Commit 360bff8 authored by Ali Yahya<alive@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Makes tape.watch() work with ResourceVariables.
To this end, also adds a property, `device`, to TensorNode.

PiperOrigin-RevId: 165726368

---
Commit 80bd004 authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Implements SVDF model for keyword spotting tutorial.

PiperOrigin-RevId: 165725938

---
Commit aaabf6b authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Fix bug: Using a ComputationDataHandle from the wrong ComputationBuilder.

PiperOrigin-RevId: 165724017

---
Commit 107d165 authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Use 2-arg TraceMe constructor to prevent unnecessary StrCat computation when
tracing is disabled.

PiperOrigin-RevId: 165722280

---
Commit 7d01f89 authored by Pete Warden<petewarden@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Android demo app for speech recognition

PiperOrigin-RevId: 165714459

---
Commit a672932 authored by Alexandre Passos<apassos@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Deletes convert_n_to_eager_tensor. Moves convert_to_eager_tensor to constant_op.

PiperOrigin-RevId: 165704074

---
Commit 573b303 authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
BUILD cleanup in tensorflow/core/kernels

PiperOrigin-RevId: 165688864

---
Commit 711be6a authored by Derek Murray<mrry@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
`Dataset.from_generator()` constructs a dataset from a Python generator.

With this change, it becomes possible to use a Python generator as the source
dataset for a `tf.contrib.data` input pipeline. This enables easier integration
with non-TensorFlow data sources. The generator can yield a nested structure of
NumPy arrays, or values convertible to NumPy arrays.

This addresses a concern raised in issue #7951.

PiperOrigin-RevId: 165663857

---
Commit 00594ec authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
New landing page and leftnav for Programmer's Guide.

PiperOrigin-RevId: 165660897

---
Commit 7359fec authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Implement Batchnorm Inference by expanding them into smaller ops.

1. Add batch norm inference support in batchnorm_rewriter
2. Connect xla's batchnorm inference to tf's FusedBatchNorm

RELNOTES: n/a
PiperOrigin-RevId: 165655351

---
Commit f0da8bf authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
[Rematerialization] Reconsider to remat operations with control dependencies

We added a conservartive logic to not rematerialize operations with control dependencies since the rematerialized operations could result in undesired ordering. However, we now realize that when we remat an operation, we also copy the dependencies of them, which guarantees the rematerialized operation has the same constraint as the original operation.

PiperOrigin-RevId: 165654629

---
Commit a122587 authored by Chris Leary<leary@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
[XLA] Propagate error code in computation replay tool.

PiperOrigin-RevId: 165654497

---
Commit 513def0 authored by Benoit Steiner<bsteiner@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Fixed BuildOpInfoWithoutDevice

PiperOrigin-RevId: 165653933

---
Commit d7e425f authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Fix linear algebra benchmarks.

PiperOrigin-RevId: 165653891

---
Commit 465c408 authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Fix the shape information propagation for Enter op.

PiperOrigin-RevId: 165653579

---
Commit c0198fd authored by Derek Murray<derek.murray@gmail.com>
Committed by gunan<gunan@google.com>:
[CMake] Add missing dependencies on boosted_trees protos and other fixes (#12315)

* [CMake] Add missing dependencies

* Avoid rebuilding boosted_trees protos for Python.

* Add GPU implementation ZeroInitializerOp to the CMake build.

---
Commit 641943f authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Update ops-related pbtxt files.

PiperOrigin-RevId: 165652758

---
Commit e313464 authored by Jonathan Hseu<jhseu@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
TPUEstimator: Fix the outfeed thread join.

PiperOrigin-RevId: 165651781

---
Commit 565a9d3 authored by Vijay Vasudevan<vrv@google.com>
Committed by Andrew Harp<andrewharp@users.noreply.github.com>:
Add missing 'type' keyword to ArgumentParser add_argument (#12275)

Fixes #12210
---
Commit 19a5572 authored by Rohan Jain<rohanj@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Allowing functions to run across devices. This change expands the ProcessFunctionLibraryRuntime library to Instantiate and Run functions on different devices. When a FunctionLibraryRuntime encounters a function with a target that is another device, it delegates Instantiate() and Run() calls to the ProcessFunctionLibraryRuntime.

This change also moves the table_ containing all function instantiations to the PFLR instead of the FunctionLibraryRuntime.

PiperOrigin-RevId: 165651194

---
Commit 8c0853d authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Add a test for negative and zero pow() input.

PiperOrigin-RevId: 165650096

---
Commit a3c4e98 authored by Pete Warden<petewarden@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Fixed input shape for freezing audio graphs

PiperOrigin-RevId: 165649546

---
Commit 9b9e598 authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Add a call_logit_fn utility for logit_fn's, similar to Estimator's _call_model_fn.

PiperOrigin-RevId: 165649388

---
Commit 4ff1f44 authored by Amit Patankar<amitpatankar@google.com>
Committed by Amit Patankar<amitpatankar@google.com>:
Remove the script as well if building tf_nightly.

---
Commit 373d789 authored by Amit Patankar<amitpatankar@google.com>
Committed by Amit Patankar<amitpatankar@google.com>:
Adding the break.

---
Commit 0139ac9 authored by Amit Patankar<amitpatankar@google.com>
Committed by Amit Patankar<amitpatankar@google.com>:
Remove tensorboard as a required package if we are building tf_nightly.

---
Commit a92bd5d authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
BEGIN_PUBLIC
Automated g4 rollback of changelist 165630063

PiperOrigin-RevId: 165957821
@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

bhack Aug 24, 2017

Contributor

@mrry Have you tested this section of the documentation with python3?

Contributor

bhack commented Aug 24, 2017

@mrry Have you tested this section of the documentation with python3?

@AMairesse

This comment has been minimized.

Show comment
Hide comment
@AMairesse

AMairesse Aug 25, 2017

@bhack I haven't been able to make it work with more than one parameter in return from the function given to py_func. I'm using python3 and didn't tried with python2.
Is your problem similar ?

AMairesse commented Aug 25, 2017

@bhack I haven't been able to make it work with more than one parameter in return from the function given to py_func. I'm using python3 and didn't tried with python2.
Is your problem similar ?

@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

bhack Aug 25, 2017

Contributor

@AMairesse The first problem was solved with 2139e7d

Contributor

bhack commented Aug 25, 2017

@AMairesse The first problem was solved with 2139e7d

@AMairesse

This comment has been minimized.

Show comment
Hide comment
@AMairesse

AMairesse Aug 26, 2017

@bhack Thanks, will try that soon, I was using a workaround which I'm not proud of :-)
The fix in the documentation is one month old and prior to v1.3 release, the tensorflow.org website is not updated when there is a new release ? Official doc does not have the fix

AMairesse commented Aug 26, 2017

@bhack Thanks, will try that soon, I was using a workaround which I'm not proud of :-)
The fix in the documentation is one month old and prior to v1.3 release, the tensorflow.org website is not updated when there is a new release ? Official doc does not have the fix

@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

bhack Aug 26, 2017

Contributor

@AMairesse I suggest you to notify this in #11786

Contributor

bhack commented Aug 26, 2017

@AMairesse I suggest you to notify this in #11786

@llvim

This comment has been minimized.

Show comment
Hide comment
@llvim

llvim Aug 28, 2017

need more operator for image process, like map_coordinates, so We can build image augmentation pipe line only use tensorflow

And Dataset do not stably init variable defined in the map function as #12648

llvim commented Aug 28, 2017

need more operator for image process, like map_coordinates, so We can build image augmentation pipe line only use tensorflow

And Dataset do not stably init variable defined in the map function as #12648

@vvekic

This comment has been minimized.

Show comment
Hide comment
@vvekic

vvekic Aug 28, 2017

I'd like to re-raise an earlier performance-related question by @kratzert that seems to have fallen out of focus. The performance gain of using the new Dataset API is negligible.

@ppwwyyxx stated that queues and StagingArea can still be used with the Dataset API, but I still haven't seen a working example of this. Do we have one?

What purpose does the new API serve if one must still include queues, data_flow_ops or StagingArea complexities?

vvekic commented Aug 28, 2017

I'd like to re-raise an earlier performance-related question by @kratzert that seems to have fallen out of focus. The performance gain of using the new Dataset API is negligible.

@ppwwyyxx stated that queues and StagingArea can still be used with the Dataset API, but I still haven't seen a working example of this. Do we have one?

What purpose does the new API serve if one must still include queues, data_flow_ops or StagingArea complexities?

@GPhilo

This comment has been minimized.

Show comment
Hide comment
@GPhilo

GPhilo Aug 29, 2017

@vvekic, I experimented a bit with queues and the Dataset API after realising in horror that of the 0.8s/step in my inference loop, 0.2s is data fetching (with GPU at 0% utilization), raising to almost 2 seconds if the HDD is being used by something else at the same time.
My pipeline looks as follows:

  def preprocess_image(fn):
    im_s = tf.read_file(fn)
    im = tf.image.decode_jpeg(im_s, channels=3)
    im = inception_preprocessing.preprocess_for_eval(im, width=299, height=299)
    return fn, im

  dataset = tf.contrib.data.Dataset.list_files('{}/*/*.jpg'.format(FLAGS.dataset_dir))
  dataset.map(preprocess_image, num_threads=FLAGS.num_threads)
  iterator = dataset.make_one_shot_iterator()
  input_queue = tf.FIFOQueue(capacity=100*FLAGS.batch_size,
                             dtypes = iterator.output_types,
                             shapes=iterator.output_shapes)
  enqueue_sample = input_queue.enqueue(iterator.get_next())
  tf.train.add_queue_runner(tf.train.QueueRunner(input_queue, [enqueue_sample]*FLAGS.num_threads))
  
  filenames, images = input_queue.dequeue_up_to(FLAGS.batch_size)

I still have to run this on a big dataset and check if there's any performance improvement, but at least it seems to execute correctly. The catch is, I couldn't find a way to iterate over the data more than once (which luckily enough is not my use-case), because the only iterator that won't raise an error when the QueueRunners spawn the threads is the one_shot_iterator.

GPhilo commented Aug 29, 2017

@vvekic, I experimented a bit with queues and the Dataset API after realising in horror that of the 0.8s/step in my inference loop, 0.2s is data fetching (with GPU at 0% utilization), raising to almost 2 seconds if the HDD is being used by something else at the same time.
My pipeline looks as follows:

  def preprocess_image(fn):
    im_s = tf.read_file(fn)
    im = tf.image.decode_jpeg(im_s, channels=3)
    im = inception_preprocessing.preprocess_for_eval(im, width=299, height=299)
    return fn, im

  dataset = tf.contrib.data.Dataset.list_files('{}/*/*.jpg'.format(FLAGS.dataset_dir))
  dataset.map(preprocess_image, num_threads=FLAGS.num_threads)
  iterator = dataset.make_one_shot_iterator()
  input_queue = tf.FIFOQueue(capacity=100*FLAGS.batch_size,
                             dtypes = iterator.output_types,
                             shapes=iterator.output_shapes)
  enqueue_sample = input_queue.enqueue(iterator.get_next())
  tf.train.add_queue_runner(tf.train.QueueRunner(input_queue, [enqueue_sample]*FLAGS.num_threads))
  
  filenames, images = input_queue.dequeue_up_to(FLAGS.batch_size)

I still have to run this on a big dataset and check if there's any performance improvement, but at least it seems to execute correctly. The catch is, I couldn't find a way to iterate over the data more than once (which luckily enough is not my use-case), because the only iterator that won't raise an error when the QueueRunners spawn the threads is the one_shot_iterator.

@tocab

This comment has been minimized.

Show comment
Hide comment
@tocab

tocab Aug 30, 2017

I don't know if I'm right here, but I have a question about the dataset API. My dataset contains one column with sequences and one with sequence length which i want treat different, because i want to pad the sequences. Is it possible to address a single column in the dataset so that it is treated different from the other column? E.g.:

two_column_dataset = ... # This contains the column sequence and sequence length
first_column_dataset = two_column_dataset[0].padded_batch(64, ...) # Pad only first column
second_column_dataset = two_column_dataset[1].batch(64) # Get corresponding sequence length for sequences
two_column_dataset = Dataset.zip((first_column_dataset, second_column_dataset))

Edit: After writing this, i found it out:

def flat_map_func(sequence, sequence_length):
    first_column_dataset = Dataset.from_tensors(sequence).padded_batch(64, ...)
    second_column_dataset = Dataset.from_tensors(sequence_length).padded_batch(64)
    zipped_dataset = Dataset.zip((first_column_dataset, second_column_dataset))
    return zipped_dataset

two_column_dataset = two_column_dataset.flat_map(flat_map_func)

tocab commented Aug 30, 2017

I don't know if I'm right here, but I have a question about the dataset API. My dataset contains one column with sequences and one with sequence length which i want treat different, because i want to pad the sequences. Is it possible to address a single column in the dataset so that it is treated different from the other column? E.g.:

two_column_dataset = ... # This contains the column sequence and sequence length
first_column_dataset = two_column_dataset[0].padded_batch(64, ...) # Pad only first column
second_column_dataset = two_column_dataset[1].batch(64) # Get corresponding sequence length for sequences
two_column_dataset = Dataset.zip((first_column_dataset, second_column_dataset))

Edit: After writing this, i found it out:

def flat_map_func(sequence, sequence_length):
    first_column_dataset = Dataset.from_tensors(sequence).padded_batch(64, ...)
    second_column_dataset = Dataset.from_tensors(sequence_length).padded_batch(64)
    zipped_dataset = Dataset.zip((first_column_dataset, second_column_dataset))
    return zipped_dataset

two_column_dataset = two_column_dataset.flat_map(flat_map_func)
@mrry

This comment has been minimized.

Show comment
Hide comment
@mrry

mrry Aug 30, 2017

Contributor

This issue thread is becoming a bit unwieldy and it's getting hard to keep track of the individual discussions, so I'm going to lock it after responding to a few of the recent comments. Please feel free to open a new issue about any specific topics of feature requests related to tf.contrib.data and we can continue the discussion there.

In response to a few recent questions:

  • @GPhilo (link) and @kratzert (link): The Dataset API includes methods for prefetching, so it shouldn't be necessary to add a queue here, and you can retain the other advantages of Datasets (like reinitialization etc.). Passing output_buffer_size=100 * FLAGS.batch_size to the dataset.map() call, and following that with dataset.batch(FLAGS.batch_size) will run your preprocess_image function in parallel and should decently increase the performance.

    dataset = tf.contrib.data.Dataset.list_files('{}/*/*.jpg'.format(FLAGS.dataset_dir))
    dataset = dataset.map(preprocess_image, num_threads=FLAGS.num_threads,
                          output_buffer_size=100*FLAGS.batch_size)
    dataset = datsaet.batch(FLAGS.batch_size)
    iterator = dataset.make_one_shot_iterator()
    filenames, images = iterator.get_next()

    Note that in TF 1.4 there will be a Dataset.prefetch() method that makes it easier to add prefetching at any point in the pipeline, not just after a map(). (You can try it by downloading the current nightly build.)

    In reponse to @kratzert's specific question about the implementation, the Dataset and Iterator classes don't use TensorFlow's previous producer/consumer queues (such as tf.FIFOQueue or tf.RandomShuffleQueue), but they do include simpler (and more efficient) implementations of the core ideas. For example, Dataset.prefetch() will start a background thread to populate a ordered buffer that acts like a tf.FIFOQueue, so that downstream pipeline stages need not block. However, the prefetch() implementation is much simpler, because it doesn't need to support as many different concurrent operations as a tf.FIFOQueue.

  • @vvekic (link): I'd be curious to see your code before and after trying the Dataset API, and perhaps you could follow up by opening an issue describing the performance bottleneck. Compared to feeding or a (non-StagingArea) queue-based pipeline, the new API should be more efficient, and I'd be curious to know which parts aren't!

    At present, you're correct that the StagingArea functionality is not included in the Dataset API, and for peak performance in GPU workloads you will need to add a staging area manually. However, we are actively working on implementing Datasets that can span devices (see 19a5572 for some of the work in progress) and one of the first use cases for that is to support prefetching into GPU memory.

  • @tengerye (link): For dynamically feeding data into a Dataset, I'd suggest you try out the Dataset.from_generator() method that we're adding to TF 1.4 (and which is available in nightly builds already). I answered @albertz's Stack Overflow question about doing this here. (Supporting distributed pipelines will depend on the cross-device Dataset support that I mentioned in the last answer, and we'll be implementing that soon.) I think this will also work for @rasmusbergpalm's request, because you can create concurrent generators, and for @tillahoffmann's request and @sirfz's request as well. This API is very new though, so if you have any feedback, please let us know!

  • @jasonkriss (link) We've implemented something called "feedable" iterators, which let you switch the input for single graph between multiple iterators (e.g. one for training and one for testing). The programmers' guide has more details about how to use this feature.

  • @guillaumekln (link) If you want to batch sequences with different lengths, you can use the Dataset.group_by_window() transformation. Have a look at how this is used in the NMT model code for an example.

Thanks again to all of you for your continued interest in this part of TensorFlow!

Contributor

mrry commented Aug 30, 2017

This issue thread is becoming a bit unwieldy and it's getting hard to keep track of the individual discussions, so I'm going to lock it after responding to a few of the recent comments. Please feel free to open a new issue about any specific topics of feature requests related to tf.contrib.data and we can continue the discussion there.

In response to a few recent questions:

  • @GPhilo (link) and @kratzert (link): The Dataset API includes methods for prefetching, so it shouldn't be necessary to add a queue here, and you can retain the other advantages of Datasets (like reinitialization etc.). Passing output_buffer_size=100 * FLAGS.batch_size to the dataset.map() call, and following that with dataset.batch(FLAGS.batch_size) will run your preprocess_image function in parallel and should decently increase the performance.

    dataset = tf.contrib.data.Dataset.list_files('{}/*/*.jpg'.format(FLAGS.dataset_dir))
    dataset = dataset.map(preprocess_image, num_threads=FLAGS.num_threads,
                          output_buffer_size=100*FLAGS.batch_size)
    dataset = datsaet.batch(FLAGS.batch_size)
    iterator = dataset.make_one_shot_iterator()
    filenames, images = iterator.get_next()

    Note that in TF 1.4 there will be a Dataset.prefetch() method that makes it easier to add prefetching at any point in the pipeline, not just after a map(). (You can try it by downloading the current nightly build.)

    In reponse to @kratzert's specific question about the implementation, the Dataset and Iterator classes don't use TensorFlow's previous producer/consumer queues (such as tf.FIFOQueue or tf.RandomShuffleQueue), but they do include simpler (and more efficient) implementations of the core ideas. For example, Dataset.prefetch() will start a background thread to populate a ordered buffer that acts like a tf.FIFOQueue, so that downstream pipeline stages need not block. However, the prefetch() implementation is much simpler, because it doesn't need to support as many different concurrent operations as a tf.FIFOQueue.

  • @vvekic (link): I'd be curious to see your code before and after trying the Dataset API, and perhaps you could follow up by opening an issue describing the performance bottleneck. Compared to feeding or a (non-StagingArea) queue-based pipeline, the new API should be more efficient, and I'd be curious to know which parts aren't!

    At present, you're correct that the StagingArea functionality is not included in the Dataset API, and for peak performance in GPU workloads you will need to add a staging area manually. However, we are actively working on implementing Datasets that can span devices (see 19a5572 for some of the work in progress) and one of the first use cases for that is to support prefetching into GPU memory.

  • @tengerye (link): For dynamically feeding data into a Dataset, I'd suggest you try out the Dataset.from_generator() method that we're adding to TF 1.4 (and which is available in nightly builds already). I answered @albertz's Stack Overflow question about doing this here. (Supporting distributed pipelines will depend on the cross-device Dataset support that I mentioned in the last answer, and we'll be implementing that soon.) I think this will also work for @rasmusbergpalm's request, because you can create concurrent generators, and for @tillahoffmann's request and @sirfz's request as well. This API is very new though, so if you have any feedback, please let us know!

  • @jasonkriss (link) We've implemented something called "feedable" iterators, which let you switch the input for single graph between multiple iterators (e.g. one for training and one for testing). The programmers' guide has more details about how to use this feature.

  • @guillaumekln (link) If you want to batch sequences with different lengths, you can use the Dataset.group_by_window() transformation. Have a look at how this is used in the NMT model code for an example.

Thanks again to all of you for your continued interest in this part of TensorFlow!

@tensorflow tensorflow locked and limited conversation to collaborators Aug 30, 2017

@mrry mrry closed this Aug 30, 2017

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.