-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tf Dataset / Iterator console flood when using CUDA builds #12414
Comments
this is a side effect of using tf.contrib.data iterators with small batches (as you are doing for inference). @mrry any way to disable the logging in a configurable way? |
(nothing to do with dynamic_rnn). |
Hi @ebrevdo, by small batches you refer to the size of most batches or the size of the last one ? I thought it has something to do with dynamic_decode because simply iterating through the dataset and printing it to the console, for example, works fine. |
How is the dataset defined? I think there's an issue in the 1.3 release when you have a |
_parse_function is a Python function, does tf.py_func get called internally ? |
What's the source of |
_parse_function takes an Example and an int, applies tf.parse_example, returns a tuple of one tf.float32 and three tf.int64 (the context and the features) |
Do you get the logged message at every run call or only every once in a
while?
…On Aug 19, 2017 10:36 AM, "georgesterpu" ***@***.***> wrote:
_parse_function takes an Example and an int, applies tf.parse_example,
returns a tuple of one tf.float32 and three tf.int64 (the context and the
features)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12414 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABtimyk5kfuqOjYeuXaKV0Vs53xuJeyjks5sZx0AgaJpZM4O8Vau>
.
|
I get it at the end of each training epoch. One workaround that I could find is stopping the loop after the last batch, if I know in advance their number (instead of using the infinite loop and stopping on exception) Here is the content of my
|
I can confirm the same issue on a couple different Ubuntu versions (and some unknown Linux distro), Python 3.5 and 3.6, TensorFlow 1.2 and 1.3, cudnn 5.1 and cudnn 6, and on Telsa P100, GeForce GTX 1070 and Pascal Titan X. I get the message at the end of each epoch, on both training and dev set. Here's my dataset definition: def setup_dataset(filename: str, batch_size: int, shuffle_seed=None):
n_recs, labels = tfrecord_metadata(filename)
dataset = tfdata.TFRecordDataset(filename)
dataset = dataset.skip(1) # I store some metadata in the first record
dataset = dataset.map(parse_label_sequence_example if label_seq
else parse_example)
if shuffle_seed is not None:
dataset = dataset.shuffle(n_recs, shuffle_seed)
dataset = dataset.map(lambda x, y: (x, y, build_mask(y)))
padded_shapes = (tf.TensorShape([None]), tf.TensorShape([None],),
tf.TensorShape([None]))
dataset = dataset.padded_batch(batch_size, padded_shapes=padded_shapes)
return dataset, labels I use a batch size of 8, and shuffle the training set but not the dev set (but the warning shows up in all cases). |
What happens to your data once it is fetched and parsed by the iterator ? |
How much do you need? Here's how it starts: shuf_seed = tf.placeholder(tf.int64, shape=[])
trn_data, n_cat = setup_dataset(args.train, args.batch_size, shuf_seed)
dev_data, _ = setup_dataset(args.dev, args.batch_size)
it = tfdata.Iterator.from_structure(trn_data.output_types,
trn_data.output_shapes)
x, y, y_mask = it.get_next()
y = tf.reshape(y, [-1])
y_mask = tf.cast(tf.reshape(y_mask, [-1]), tf.float32)
temporal_padding = args.filter_size[0] - 1
t_pad_before = temporal_padding // 2
t_pad_after = temporal_padding - t_pad_before
x = tf.pad(x, [[0, 0], [t_pad_before, t_pad_after]])
trn_init = it.make_initializer(trn_data)
dev_init = it.make_initializer(dev_data)
<snip>
emb_layer = tf.Variable(embeddings, trainable=args.trainable_embeddings,
name='embedding_matrix')
x_embedded = tf.nn.embedding_lookup(emb_layer, x) |
Never mind, I made a wrong assumption. Looking at the first comment of @mrry, the issue might be fixed by now in tf nightly 1.4. |
I'd be willing to test it out---I'm not sure what the git model is that's used for TF, but if I compile from master and test that, will that suffice? |
Maybe try `pip install tf-nightly` in a python 3.5 environment first
…On 20 September 2017 at 19:49, Aditya Bhargava ***@***.***> wrote:
I'd be willing to test it out---I'm not sure what the git model is that's
used for TF, but if I compile from master and test that, will that suffice?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#12414 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFvUy5NVYYdUGT27gEMpRICMnviwrgVEks5skV4igaJpZM4O8Vau>
.
|
This issue is specific to the GPU builds, and the TF README.md says the nightlies are CPU builds only---has that changed? |
I might not be the right person to answer this, but you are correct in my
opinion; my mistake. Please try compiling from sources.
…On 20 September 2017 at 20:44, Aditya Bhargava ***@***.***> wrote:
This issue is specific to the GPU builds, and the TF README.md says the
nightlies are CPU builds only---has that changed?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#12414 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFvUyxWHvrV01K6gEzSeArThvl3RYE2mks5skWsAgaJpZM4O8Vau>
.
|
I just cloned master yesterday and compiled that and gave it a shot, but the error still shows up for me with that latest build. Not sure how it was for @georgesterpu but I actually get multiple copies of the message:
On my dev set, which is smaller, I just get one:
|
I meet the same problem and fix by flowing. |
@fumihwh That won't work for me, as I need to know when the epoch is done. |
@rightaditya how about this with
|
I have just cloned the master repo and compiled from sources. This time with cuDNN 7 :) |
@fumihwh Is your dataset size divisible by your batch size? Mine isn't, so I'd probably have to Still, it's not an ideal solution given the way the API seems to have been designed. |
@fumihwh Is the I encountered the same error as described by previous users during testing but not in training. During evaluation, there are thousands of " Out of range: End of sequence" errors (I guess the number of errors are the same as the number of my evaluation samples). But correct evaluation results are still printed out after those errors, and the program did not crash and can still continue training. Anyone knows the reason and how to fix it? Thank you. I used estimator, and the input functions are
My model function is
|
I get flooded the same way using today's tf-nightly-gpu wheel binary on Ubuntu 16.04, Python 2.7, cudnn 6. |
I get the same warning with tensorflow 1.3 |
I am experiencing this with tensorflow-gpu on Windows 10, Python 3.6.1, cudnn64_6, using tf.slim. Still need to verify this, but it seems the message only comes up when training for the first time, without a checkpoint. The dataset size is divisible by the batch size. |
Same problem with TF1.3 |
Same problem. I got a bunch of error logs at the end of each epoch. Very annoying. "2017-10-31 18:59:22.473230: W tensorflow/core/framework/op_kernel.cc:1192] Out of range: End of sequence [[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,120,120,3], [?,120,120,1]], output_types=[DT_FLOAT, DT_UINT8], _device="/job:localhost/replica:0/task:0/cpu:0"]]" |
Same problem with TF 1.4 |
I think i can guess out what cause this problems but i do not know how to resolve it. 2017-11-04 10:47:47.713841: W tensorflow/core/framework/op_kernel.cc:1192] Out of range: End of sequence Maybe we should treat this training process with multi threads, so that each of training threads only consume the return data form just one dataset thread. |
Maybe there are some other causes because by setting the |
Yes, you can wrap the dataset with a repeat () call and use an outer for
loop for reach epoch if you know the number of elements per epoch.
…On Sat, Nov 4, 2017, 10:23 PM Jeremy Hsu ***@***.***> wrote:
Maybe there are some other causes because by setting the
num_parallel_calls argument to 1 in tf.data.Dataset.map doesn't prevent
the warning from flooding.
BTW, a very useless tip to avoid the warning is by using for-loops
instead of try-except (and remember to run the initializer every time
before the loop). However, this only works when the dataset is simple
enough that you know the exact number of iterations per epoch...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12414 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABtim9Lr0GYVCOSb0RWqKvRMSsbLtaB5ks5szUZtgaJpZM4O8Vau>
.
|
I can also confirm this problem and I am also not passing a |
@ybsave |
Thanks for clarifying that @libulin! While it's true that the log messages are not strictly a bug, we appreciate that it might be confusing or annoying to have a flood of messages in your console from code that's operating correctly :). With the fix in 301a6c4 (and a related fix for the |
…ernel. A failing call to `Send()` indicates that the step has been aborted by a corresponding call to `Rendezvous::StartAbort()`. As a result, the error logged by `Send()` is not particularly informative, and creates a non-deterministic amount of extra log spam for each step that fails as `Send()` calls are being issued. The failure that causes the step to be aborted is logged separately by the kernel that failed, unless that kernel deliberately does not log on failure. In particular, this change reduces log spam when using `Iterator.get_next()` in a multi-device setting. The `Iterator.get_next()` op deliberately does not log when an `OutOfRange` error (indicated the end of the dataset) is raised, because this is common and expected behavior, especially when using an initializable iterator that is reinitialized at the end of an epoch. Previously, when running in distributed mode or using a GPU, pending `Send()` calls may cause unwanted log messages to be printed. Fixes #12414. PiperOrigin-RevId: 175716290
I am using TF 1.4.0. The following works, though weird. import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # ERROR
import tensorflow as tf |
I am also getting this Warning |
@maxfiedler Can you reproduce this with TF 1.5.0? If so, please open a new issue with code to reproduce the warning and we'll take a look. |
@mrry I am still working with TF 1.4.0. I will let you know if the problem persists, once we switched to TF 1.5.0 and produce a code example if so |
A few more things I notice (still on TF 1.4).
Plus a
and
Is there a way to signal the while loop that the iterator has reached the end of the dataset instead of throwing an OutOfRangeError? I am following the "load on batch (i.e. one get_next() op) and distribute it over the GPUs via tf.split" scheme. When the last batch is not cleanly splittable I get understandably an tf.InvalidArgumentError. But before I get a ton of the following warnings:
|
I have the same Out of range: End of sequence warning with tf 1.4. |
@tanabics it was fixed for me with 1.5+ |
Hi - I have a related problem. I am able to run several epochs, but then it suddenly stops ... Start training process ... The process is stopped and I get the error message: OutOfRangeError: End of sequence I have asked for 70 epochs (and data is prepared for that), but something is not working. I use Tensorflow 1.10. The model is not saved, so there must be a strange bug. It looks like some of the same issues as in this thread, but my model is not estimated and finished up by TF. I get the impression that some in this thread get an estimated model, but in addition get a error message. I dont have a model returned by TF. Anyone ? |
I got a similar problem when running serveral test cases, such as The error messages are as following:
|
I think that was an effect of 2c97ee3. I agree that this makes the output of If you'd like to send a PR that disables the logging for |
@mrry Thanks for your quick reply! Will submit a PR for this. |
Hello |
System information
The problem
When using a tensorflow wheel built with cuda support, my app prints the following warning message at the end of a training epoch:
2017-08-19 14:01:18.214060: W tensorflow/core/framework/op_kernel.cc:1192] Out of range: End of sequence [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[?,?,132], [?], [?,?], [?]], output_types=[DT_FLOAT, DT_INT64, DT_INT64, DT_INT64], _device="/job:localhost/replica:0/task:0/cpu:0"](Iterator)]]
The code trains a seq2seq model, and I assume the message gets printed somewhere downstream of seq2seq.dynamic_decode. The message still gets printed even when the NN cells are not wrapped with a tf.contrib.rnn.DeviceWrapper with device field indicating a GPU, only works fine on non-cuda builds.
All of this happens while the code is protected with the try/except statements:
Now the only cheeky thing is that I am using the binaries from Arch Linux repositories, but these are far from being dodgy.
python-tensorflow
python-tensorflow-cuda
The build script
This problem was in tensorflow 1.2 and persists in tensorflow 1.3.
Also tested on a laptop without dedicated gpu, but same OS and packages, works fine.
The text was updated successfully, but these errors were encountered: