New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: multi-epoch alternative to tf.QueueBase.close() #2514
Comments
Why not build a separate graph with its own queue for eval? You'll also get the flexibility of being able to add additional flexibility to append the eval graph for eval purposes. |
Thank you for the suggestion. I guess I stayed away from two graphs to avoid code duplication between train and eval graphs and the accompanying maintenance issues, and because it would require updating the eval model weights with the trained model weights. I could define them both within one graph and share the weights, but that feels messy and still has the duplication. With a single model, I can just run |
If this isn't how others are structuring their code, then there is no urgency/necessity. I have been pre-computing the number of enqueue()'s that will be performed in an epoch, and then running the consumer the appropriate number of times, with asserts that the producer is not alive and the queue is empty before starting the next epoch. |
People usually use one method to create the core of the code, and a
|
I think I see what you are saying. Just to be certain, do you by chance have a specific example? I looked through tensorflow/contrib/learn/python/learn but did not find an example of this on my first scan. No worries if not, thank you. |
I have exactly the issue that markpwoodward does. If you have tensorflow v0.8 you might be able to use this workaround where by default you stream directly from the train queue except sometimes you pass test data in through a feed_dict. To be fair, I haven't tried it, and I came here looking for a better solution. I imagine it would look like this:
|
Picking up this old thread, a question: why don't you use |
I ended up doing the switching between validation and train in a python "I'm training a deep network with two data input pipelines, one for On Mon, Jul 18, 2016 at 12:21 PM, Christian notifications@github.com
|
Oops, tf.cond may be the wrong choice, as ebrevdo mentions in the parallel Perhaps it could still work if what was passed to tf.cond() was the On Mon, Jul 18, 2016 at 5:59 PM, Mark Woodward mwoodward@cs.stanford.edu
|
What can work is a cond(filter_predicate, lambda: queue.enqueue(..), On Jul 18, 2016 8:20 PM, "Mark Woodward" notifications@github.com wrote:
|
With |
Christian, just a clarification, doesn't calling make_template twice create On Tue, Jul 19, 2016 at 12:05 AM, Christian notifications@github.com
|
@markpwoodward hmm, i don't know what an operation path is. Say you call it twice, with two different input tensors (e.g. coming from a deque op on your training/validation set respectively). What is happening is that in the first call, the graph with the parameters is constructed. In the second call, this graph is simply reused. If you open up tensorboard, you see exactly this view, sort of a bottlenecked picture (if your shared graph is followed by some additional, loss-connected ops, which you need to have duplicated, that's true): At the bottom the two queues and their nodes, both feeding into the network expression that is made up of shared parameters, then going splitting up again to two loss-connected paths. With respect to messiness, I thought it the least messy solution, as I only had to add a for loop over the queues and kept the rest the same, make_template taking care of the rest. But messiness is probably quite subjective :-). |
@osdf, thank you for the response. Would you mind include the picture from On Tue, Jul 19, 2016 at 11:16 AM, Christian notifications@github.com
|
You may be able to use QueueBase.from_list to dynamically select which queue to dequeue from, see: |
I don't know how I missed that! Thank you. I just tested it. It works great; two FIFOQueues, one placeholder to select the queue. This feature request was side tracked a bit, I will leave it open as my original request of a way to signal the last dequeue of an epoch, without needing to count dequeue's, still stands. Low priority, since it is easy enough to count dequeue's and this feature is likely less relevant for larger datasets, where people don't usually do things on epoch boundaries. |
@markpwoodward Can you explain how to use the feature pointed out by @josh11b? A code snippet would be great. |
Hi, I am using the same solution that was mentioned here, having a boolean placeholder |
Thanks. So I basically need to pass functions into queue = tf.cond(is_training, train_queue, test_queue)
batch = queue.dequeue_many(batch_size) |
The issues that I have with the two original proposals are:
What to do with batches? ie the single rows are handled by a buffering batch creator
Again with batching, If the epoch size is not a multiple of the batch size, then part of the next epoch ends up on the last batch. Perhaps an alternative solution to the OP would be a queues that raises |
One solution that I have found is to create a new
|
Regarding the side track, not the original request.
outputs
|
QueueBase.from_list may be deprecated soon. +@mrry On Sep 9, 2016 1:14 PM, "Mark Woodward" notifications@github.com wrote:
|
Many thanks to @yaroslavvb : The time taken to add the operators was just 90 seconds in my case - so @TimZaman : Thanks a LOT for taking the time to share your approach using scopes instead of |
@yaroslavvb, I'd also greatly appreciate some sample code on how to use tf.train.maybe_batch. I imagine that keep_input needs to be a placeholder passed in during graph eval. @TimZaman, I also implemented the train and val with different feeds, with two different models in different name_scopes but still reusing variables across the graphs obviously. I found that if I created the train graph first and then the val graph, evaluating the val graph still caused the train input pipeline to dequeue, implying that the graphs were not separated well enough. I see you mentioned that issue earlier in this feed. How do you ensure graph separation using different input feeds? I created a google groups discussion to share different tensorflow workflows. I think it would be very helpful to me and other more inexperienced users if some of you guys could maybe share your code designs. Thanks! |
Does anyone by now have a minimum working example using maybe_batch to switch between train and test queues? I'd greatly appreciate it! |
We're planning to move away from queues and provide first-class support for multi-epoch processing in the redesigned input pipeline API. Please feel free to comment on #7951 if there are particular features that you'd particularly like to see in the new API! |
in the meantime, what is the standard approach for running multiple training epochs using a queue? I've searched around for a while, and the closest thing I could find was http://stackoverflow.com/a/39209186/212731 "http://stackoverflow.com/a/39209186/212731", for which @mrry provides a workaround. Is this the standard technique we should follow for now? or ... ? (basically, I have a bunch of examples, which I'm happy to store in a file as tfrecords, but I need to run indefinitely; specifying the number of epochs at the start is not really ideal for me. Having to guess how many steps per epoch is also not ideal). |
How would I use |
does someone used tf.train.maybe_batch() and wants to share the way he used it ? It would be awersone |
In my opinion, According to the API documentation, keep_input is a bool Tensor, so it should accept a I tried something like that:
I thought of |
Hi, Thanks |
Here is a hacky solution using def select_batch(batches, index):
"""
Select a batch based on the current value of the index. Only the active batch
will be consumed. Each batch can be an arbitrarily nested tuple or list.
"""
def _get_dtypes(tensors):
if isinstance(tensors, (list, tuple)):
return type(tensors)(_get_dtypes(tensor) for tensor in tensors)
return tensors.dtype
def _get_shapes(tensors):
if isinstance(tensors, (list, tuple)):
return type(tensors)(_get_shapes(tensor) for tensor in tensors)
return tensors.shape
def _flatten(collection):
if isinstance(collection, (list, tuple)):
return sum([_flatten(element) for element in collection], [])
return [collection]
def _unflatten(iterator, shapes):
if isinstance(shapes, (list, tuple)):
return type(shapes)(_unflatten(iterator, shape) for shape in shapes)
return next(iterator)
queues = []
for batch in batches:
dtypes, shapes = _get_dtypes(batch), _get_shapes(batch)
queue = tf.FIFOQueue(10, _flatten(dtypes), _flatten(shapes))
runner = tf.train.QueueRunner(queue, (queue.enqueue(_flatten(batch)),))
tf.train.add_queue_runner(runner)
queues.append(queue)
batch = tf.FIFOQueue.from_list(index, queues).dequeue()
return _unflatten(iter(batch), shapes) |
@tpatel0409 @LucasMahieu @danijar Have you found an example of how to use tf.train.maybe_batch in the mean time? Or did you end up switching to the new dataset api? |
Personally I decided to use the "TF SLIM “ API juste before the 1.2 release : where the dataset API was introduced.
I don’t know the best solution right know.
… On 29 Jun 2017, at 12:01, Julien Siems ***@***.*** ***@***.***>> wrote:
@tpatel0409 <https://github.com/tpatel0409> @LucasMahieu <https://github.com/lucasmahieu> @danijar <https://github.com/danijar> Have you found an example of how to use tf.train.maybe_batch in the mean time? Or did you end up switching to the new dataset api?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#2514 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALc1oOEpyP9ufuOliv8aUAR7PipI2IgQks5sI3XpgaJpZM4Inwby>.
|
@LucasMahieu Thank you for your quick response. Does the tf slim api provide any way of solving the problem? I am using tf slim as well but just for using predefined layers. I am currently looking into tf.train.maybe_batch and will post here if I come up with any example. |
Maybe @ebrevdo ? Since you suggested it. It would be great if you could post a snippet |
Also curious and will be reading through TF Slim data API today. I will post if I come across a solution. |
…' guide. This is a potential solution to issue tensorflow#2514. PiperOrigin-RevId: 161732107
https://www.tensorflow.org/versions/r1.3/programmers_guide/datasets Looks like the dataset api is the recommended way to do it (See for example tfrecords reader). Since I am using the input pipeline I ended up evaluating batches of testdata and then feeding them back into the training queue. |
With this change, it becomes possible to use a Python generator as the source dataset for a `tf.contrib.data` input pipeline. This enables easier integration with non-TensorFlow data sources. The generator can yield a nested structure of NumPy arrays, or values convertible to NumPy arrays. This addresses a concern raised in issue tensorflow#7951. PiperOrigin-RevId: 165663857
How about if the labels for train and test are different? For example, If I have two lables for train, but only one label for test. |
If you are using tf.estimator.Estimator, which I think is the current best way to go (https://stackoverflow.com/questions/46925196/does-tf-estimator-estimator-train-maintain-input-fn-state), you can pass a different input_fn for train and evaluate, and you can use the mode passed to model_fn to create the appropriate graph. |
I just closed this. Feel free to re-open it. In my opinion, If you need to do things at the end of each epoch then just run Often times the dataset is too big to wait for epoch boundaries, so use a I initially had a problem with this approach because of the perceived overhead of creating the graph on each call to |
@TimZaman Is there a example(make_template or share weights by sope) of try do so such things(diffrent feed for train and test). I try to use share weights method but end result that the testing data is always using the wights initialized. But the wights for traing have been updated.
Any one can help me? |
@markpwoodward Many thanks for your update and stackoverflow link - I would like to migrate to estimator API as well. I still have a challenge - can you share your thoughts please? I would like to run evaluation before my epoch ends (as stated in your post as well because dataset is huge). Earlier I used to run evaluation and then save a checkpoint if score is better than the old one. Is this logical? From your post i understand that evaluation (by a listener) is run whenever a checkpoint is saved. |
@rsethur Saving only the best checkpoint would be efficient, but I am not sure how to do that with this setup. Also, evaluation loss may not be the thing you want to optimize, you might want to review a number of evaluation metrics and pick the checkpoint that looks best in a general sense. I just keep all checkpoints, and visually inspect if I need to train more. As for setting the frequency that estimator.train() saves checkpoints, there are probably other ways to do this, but I do it in a RunConfig object passed to Estimator's constructor.
Also, take a look at the new Estimator.train_and_evaluate(), you may prefer it. I still prefer my proposed approach as I actually run evaluation on a fixed subset of my training data in addition to running evaluation on my validation set. I haven't been able to get train_and_evaluate() to support this (e.g. multiple EvalSpec's) |
@markpwoodward Many thanks for your detailed response - much appreciated! |
Examples on the web demonstrate signaling the end of queue data by calling queue.close() in the data producer and then catching the tf.errors.OutOfRangeError exception in the data consumer.
This works fine for a single epoch, but I do multiple epochs, alternating between training data and testing data, and I can't reuse the queue after calling queue.close().
The two solutions that I have thought of using the existing code are:
Both seem a little hacky.
Multi-epoch use of queues might be simplified by adding one of the following:
example usage of 1):
The text was updated successfully, but these errors were encountered: