conv2d gives NaN gradients with float16 input #7226

kshmelkov · 2017-02-02T20:06:16Z

Environment info

Operating System: Ubuntu 16 LTS
breaks already on CPU

If installed from binary pip package, provide:

A link to the pip package you installed: recent nightly build
The output from python -c "import tensorflow; print(tensorflow.__version__)".

>tf.__version__
'0.12.head'
>tf.__git_version__
'0.12.1-2263-g4cc0d1e-dirty'

If possible, provide a minimal reproducible example (We usually don't have time to read hundreds of lines of your code)

import tensorflow as tf
import numpy as np

slim = tf.contrib.slim

dtype = tf.float16
shape = (4, 16, 16, 3)

inpt = tf.placeholder(dtype, shape, name='input')
net = slim.conv2d(inpt, 16, [3, 3], scope='conv')
loss = tf.reduce_mean(net)
opt = tf.train.AdamOptimizer(1e-3)
train_op = slim.learning.create_train_op(loss, opt)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(2):
        val = np.random.randn(*shape)
        print(sess.run(train_op, feed_dict={inpt: val}))

So basically it breaks on the second step of SGD because loss is NaN. If I change dtype in float32, it works. It should have nothing to do with CUDA, I tested it on CPU version as well as on GPU with CUDA8, CuDNN5.1.

What other attempted solutions have you tried?

I have no idea what to try here. Now I continue with float32.

Logs or other output that would be helpful

(If logs are large, please upload as attachment or provide link).

-0.00072765
Traceback (most recent call last):
  File "/home/lear/kshmelko/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1022, in _do_call
    return fn(*args)
  File "/home/lear/kshmelko/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1004, in _run_fn
    status, run_metadata)
  File "/usr/lib/python3.5/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/home/lear/kshmelko/.local/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: LossTensor is inf or nan : Tensor had NaN values
         [[Node: train_op/CheckNumerics = CheckNumerics[T=DT_HALF, message="LossTensor is inf or nan", _device="/job:localhost/replica:0/task:0/cpu:0"](control_dependency)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test_bn.py", line 22, in <module>
    print(sess.run(train_op, feed_dict={inpt: val}))
  File "/home/lear/kshmelko/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 767, in run
    run_metadata_ptr)
  File "/home/lear/kshmelko/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 965, in _run
    feed_dict_string, options, run_metadata)
  File "/home/lear/kshmelko/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run
    target_list, options, run_metadata)
  File "/home/lear/kshmelko/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: LossTensor is inf or nan : Tensor had NaN values
         [[Node: train_op/CheckNumerics = CheckNumerics[T=DT_HALF, message="LossTensor is inf or nan", _device="/job:localhost/replica:0/task:0/cpu:0"](control_dependency)]]

Caused by op 'train_op/CheckNumerics', defined at:
  File "test_bn.py", line 16, in <module>
    train_op = slim.learning.create_train_op(loss, opt)
  File "/home/lear/kshmelko/.local/lib/python3.5/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 472, in create_train_op
    'LossTensor is inf or nan')
  File "/home/lear/kshmelko/.local/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 433, in check_numerics
    message=message, name=name)
  File "/home/lear/kshmelko/.local/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
    op_def=op_def)
  File "/home/lear/kshmelko/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2402, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/lear/kshmelko/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1264, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): LossTensor is inf or nan : Tensor had NaN values
         [[Node: train_op/CheckNumerics = CheckNumerics[T=DT_HALF, message="LossTensor is inf or nan", _device="/job:localhost/replica:0/task:0/cpu:0"](control_dependency)]]

The text was updated successfully, but these errors were encountered:

jmchen-g · 2017-02-06T08:32:25Z

It is normal to see that float16 doesn't have enough range especially in the beginning of the training. So this is an intended behavior instead of a bug...

If you want to ask around about how to train with float 16, please go to stackoverflow... Thanks.

kshmelkov · 2017-02-06T09:12:16Z

Are you kidding me or what? How can it not have enough capacity when we start we just one convolution?

Fine, let's modify an example. Choose conv layer with zero weights

net = slim.conv2d(inpt, 16, [3, 3], scope='conv', weights_initializer=tf.zeros_initializer())

and as well zero batch:

        val = np.zeros(shape)

It still fails. Do you imply that float16 does not have enough capacity to backpropagate on zero batch through all zero convolution? It is clearly a bug somewhere in native code. Please reopen this issue, it can't be intended behaviour.

mjlm · 2017-02-28T23:32:24Z

For others with this issue, see here:
http://stackoverflow.com/questions/42064941/tensorflow-float16-support-is-broken

Setting the Adam epsilon to 1e-4 works for me.

kshmelkov mentioned this issue Feb 5, 2017

contrib.batch_norm fails on float16 input #7164

Closed

jmchen-g closed this as completed Feb 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conv2d gives NaN gradients with float16 input #7226

conv2d gives NaN gradients with float16 input #7226

kshmelkov commented Feb 2, 2017

jmchen-g commented Feb 6, 2017

kshmelkov commented Feb 6, 2017

mjlm commented Feb 28, 2017 •

edited

conv2d gives NaN gradients with float16 input #7226

conv2d gives NaN gradients with float16 input #7226

Comments

kshmelkov commented Feb 2, 2017

Environment info

If possible, provide a minimal reproducible example (We usually don't have time to read hundreds of lines of your code)

What other attempted solutions have you tried?

Logs or other output that would be helpful

jmchen-g commented Feb 6, 2017

kshmelkov commented Feb 6, 2017

mjlm commented Feb 28, 2017 • edited

mjlm commented Feb 28, 2017 •

edited