Batchnorm errors in "successful" Windows builds #11865

aluo-x · 2017-07-29T07:05:18Z

Current system configuration:
Windows 10 64 bit, intel i7-7700HQ latest microcode, Nvidia 1050 4GB.
Driver: 384.94
Python used: Anaconda 4.4.0 Python 3.6.2 and 3.5.3
CUDA/cuDNN: 8.0.61/5.1 or 8.0.61.2/6.0
swigwin 3.0.12
Built 1.2.1 from source using VS 2015 Update 3, CMake 3.9.0 or 3.9.0 RC5, swigwin 3.0.12.
Code modifications: in builds with both cuDNN and AVX enabled, the code was modified accord to this comment
Issue description: In certain conditions "successful" builds of tensorflow with GPU support, results in broken batchnorm functionality. An example error:

InvalidArgumentError (see above for traceback): indices[1] is out of range
[[Node: gradients/batch_normalization/moments/Mean_1_grad/DynamicStitch = DynamicStitch[N=2, T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/batch_normalization/moments/Mean_1_grad/range, gradients/batch_normalization/moments/Mean_1_grad/mod, gradients/batch_normalization/moments/Mean_1_grad/Shape, gradients/batch_normalization/moments/Mean_1_grad/Fill)]]

This error was encountered in a variety of different builds. But it was most surprising when it occurred in an unmodified python 3.6 gpu build. Files and configuration can be found here.

aluo-x · 2017-07-29T07:12:20Z

A complete error print out:

C:\Users\ALuo\Anaconda3\python.exe C:/Users/ALuo/.PyCharm2017.2/config/scratches/scratch.py
Extracting /tmp/tensorflow/mnist/input_data\train-images-idx3-ubyte.gz
Extracting /tmp/tensorflow/mnist/input_data\train-labels-idx1-ubyte.gz
Extracting /tmp/tensorflow/mnist/input_data\t10k-images-idx3-ubyte.gz
Extracting /tmp/tensorflow/mnist/input_data\t10k-labels-idx1-ubyte.gz
2017-07-29 00:09:56.891124: W c:\optimae\tensorflow-1.2.1\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE instructions, but these are available on your machine and could speed up CPU computations.
2017-07-29 00:09:56.891322: W c:\optimae\tensorflow-1.2.1\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-29 00:09:56.891533: W c:\optimae\tensorflow-1.2.1\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-29 00:09:56.891720: W c:\optimae\tensorflow-1.2.1\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-29 00:09:56.891906: W c:\optimae\tensorflow-1.2.1\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-29 00:09:56.892095: W c:\optimae\tensorflow-1.2.1\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-07-29 00:09:56.892276: W c:\optimae\tensorflow-1.2.1\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-29 00:09:56.892467: W c:\optimae\tensorflow-1.2.1\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-07-29 00:09:57.558636: I c:\optimae\tensorflow-1.2.1\tensorflow\core\common_runtime\gpu\gpu_device.cc:940] Found device 0 with properties:
name: GeForce GTX 1050
major: 6 minor: 1 memoryClockRate (GHz) 1.493
pciBusID 0000:01:00.0
Total memory: 4.00GiB
Free memory: 3.30GiB
2017-07-29 00:09:57.558862: I c:\optimae\tensorflow-1.2.1\tensorflow\core\common_runtime\gpu\gpu_device.cc:961] DMA: 0
2017-07-29 00:09:57.559124: I c:\optimae\tensorflow-1.2.1\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0: Y
2017-07-29 00:09:57.559258: I c:\optimae\tensorflow-1.2.1\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0)
step 0, training accuracy 0.24
2017-07-29 00:09:59.180245: W c:\optimae\tensorflow-1.2.1\tensorflow\core\framework\op_kernel.cc:1158] Invalid argument: indices[1] is out of range
[[Node: gradients/batch_normalization/moments/shifted_mean_grad/DynamicStitch = DynamicStitch[N=2, T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/batch_normalization/moments/shifted_mean_grad/range, gradients/batch_normalization/moments/shifted_mean_grad/mod, gradients/batch_normalization/moments/shifted_mean_grad/Shape, gradients/batch_normalization/moments/shifted_mean_grad/Fill)]]
2017-07-29 00:09:59.180255: W c:\optimae\tensorflow-1.2.1\tensorflow\core\framework\op_kernel.cc:1158] Invalid argument: indices[1] is out of range
[[Node: gradients/batch_normalization/moments/shifted_mean_grad/DynamicStitch = DynamicStitch[N=2, T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/batch_normalization/moments/shifted_mean_grad/range, gradients/batch_normalization/moments/shifted_mean_grad/mod, gradients/batch_normalization/moments/shifted_mean_grad/Shape, gradients/batch_normalization/moments/shifted_mean_grad/Fill)]]
2017-07-29 00:09:59.180265: W c:\optimae\tensorflow-1.2.1\tensorflow\core\framework\op_kernel.cc:1158] Invalid argument: indices[1] is out of range
[[Node: gradients/batch_normalization/moments/shifted_mean_grad/DynamicStitch = DynamicStitch[N=2, T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/batch_normalization/moments/shifted_mean_grad/range, gradients/batch_normalization/moments/shifted_mean_grad/mod, gradients/batch_normalization/moments/shifted_mean_grad/Shape, gradients/batch_normalization/moments/shifted_mean_grad/Fill)]]
2017-07-29 00:09:59.180277: W c:\optimae\tensorflow-1.2.1\tensorflow\core\framework\op_kernel.cc:1158] Invalid argument: indices[1] is out of range
[[Node: gradients/batch_normalization/moments/shifted_mean_grad/DynamicStitch = DynamicStitch[N=2, T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/batch_normalization/moments/shifted_mean_grad/range, gradients/batch_normalization/moments/shifted_mean_grad/mod, gradients/batch_normalization/moments/shifted_mean_grad/Shape, gradients/batch_normalization/moments/shifted_mean_grad/Fill)]]
2017-07-29 00:09:59.180283: W c:\optimae\tensorflow-1.2.1\tensorflow\core\framework\op_kernel.cc:1158] Invalid argument: indices[1] is out of range
[[Node: gradients/batch_normalization/moments/shifted_mean_grad/DynamicStitch = DynamicStitch[N=2, T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/batch_normalization/moments/shifted_mean_grad/range, gradients/batch_normalization/moments/shifted_mean_grad/mod, gradients/batch_normalization/moments/shifted_mean_grad/Shape, gradients/batch_normalization/moments/shifted_mean_grad/Fill)]]
2017-07-29 00:10:00.245425: W c:\optimae\tensorflow-1.2.1\tensorflow\core\framework\op_kernel.cc:1158] Invalid argument: indices[1] is out of range
[[Node: gradients/batch_normalization/moments/shifted_mean_grad/DynamicStitch = DynamicStitch[N=2, T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/batch_normalization/moments/shifted_mean_grad/range, gradients/batch_normalization/moments/shifted_mean_grad/mod, gradients/batch_normalization/moments/shifted_mean_grad/Shape, gradients/batch_normalization/moments/shifted_mean_grad/Fill)]]
Traceback (most recent call last):
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1139, in _do_call
return fn(*args)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1121, in _run_fn
status, run_metadata)
File "C:\Users\ALuo\Anaconda3\lib\contextlib.py", line 88, in exit
next(self.gen)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[1] is out of range
[[Node: gradients/batch_normalization/moments/shifted_mean_grad/DynamicStitch = DynamicStitch[N=2, T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/batch_normalization/moments/shifted_mean_grad/range, gradients/batch_normalization/moments/shifted_mean_grad/mod, gradients/batch_normalization/moments/shifted_mean_grad/Shape, gradients/batch_normalization/moments/shifted_mean_grad/Fill)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:/Users/ALuo/.PyCharm2017.2/config/scratches/scratch.py", line 60, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(sys.argv[:1] + flags_passthrough))
File "C:/Users/ALuo/.PyCharm2017.2/config/scratches/scratch.py", line 54, in main
train_step.run(feed_dict={x: batch[0], y: batch[1], train_phase: True})
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 1706, in run
_run_using_default_session(self, feed_dict, self.graph, session)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 3963, in _run_using_default_session
session.run(operation, feed_dict)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 789, in run
run_metadata_ptr)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 997, in _run
feed_dict_string, options, run_metadata)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1132, in _do_run
target_list, options, run_metadata)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[1] is out of range
[[Node: gradients/batch_normalization/moments/shifted_mean_grad/DynamicStitch = DynamicStitch[N=2, T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/batch_normalization/moments/shifted_mean_grad/range, gradients/batch_normalization/moments/shifted_mean_grad/mod, gradients/batch_normalization/moments/shifted_mean_grad/Shape, gradients/batch_normalization/moments/shifted_mean_grad/Fill)]]

Caused by op 'gradients/batch_normalization/moments/shifted_mean_grad/DynamicStitch', defined at:
File "C:/Users/ALuo/.PyCharm2017.2/config/scratches/scratch.py", line 60, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "C:/Users/ALuo/.PyCharm2017.2/config/scratches/scratch.py", line 42, in main
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\training\optimizer.py", line 315, in minimize
grad_loss=grad_loss)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\training\optimizer.py", line 386, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\ops\gradients_impl.py", line 540, in gradients
grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\ops\gradients_impl.py", line 346, in _MaybeCompile
return grad_fn() # Exit early
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\ops\gradients_impl.py", line 540, in
grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\ops\math_grad.py", line 94, in _MeanGrad
sum_grad = _SumGrad(op, grad)[0]
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\ops\math_grad.py", line 56, in _SumGrad
output_shape_kept_dims = math_ops.reduced_shape(input_shape, op.inputs[1])
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\ops\math_ops.py", line 2272, in reduced_shape
array_ops.fill(axes_shape, 1)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\ops\gen_data_flow_ops.py", line 481, in dynamic_stitch
name=name)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
op_def=op_def)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 1269, in init
self._traceback = _extract_stack()

...which was originally created as op 'batch_normalization/moments/shifted_mean', defined at:
File "C:/Users/ALuo/.PyCharm2017.2/config/scratches/scratch.py", line 60, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
[elided 0 identical lines from previous traceback]
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "C:/Users/ALuo/.PyCharm2017.2/config/scratches/scratch.py", line 25, in main
h_norm4 = tf.layers.batch_normalization(h_conv4, training=train_phase)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\layers\normalization.py", line 441, in batch_normalization
return layer.apply(inputs, training=training)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\layers\base.py", line 492, in apply
return self.call(inputs, *args, **kwargs)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\layers\base.py", line 441, in call
outputs = self.call(inputs, *args, **kwargs)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\layers\normalization.py", line 287, in call
mean, variance = nn.moments(inputs, reduction_axes)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\ops\nn_impl.py", line 642, in moments
math_ops.subtract(y, shift), axes, keep_dims=True, name="shifted_mean")
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1355, in reduce_mean
name=name)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 1290, in _mean
keep_dims=keep_dims, name=name)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
op_def=op_def)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 1269, in init
self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): indices[1] is out of range
[[Node: gradients/batch_normalization/moments/shifted_mean_grad/DynamicStitch = DynamicStitch[N=2, T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/batch_normalization/moments/shifted_mean_grad/range, gradients/batch_normalization/moments/shifted_mean_grad/mod, gradients/batch_normalization/moments/shifted_mean_grad/Shape, gradients/batch_normalization/moments/shifted_mean_grad/Fill)]]

Process finished with exit code 1

An entry point might be comparing my failing python 3.6.2 build with no modifications with the official tensorflow build.

Running a diff -rqb:

Files tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.data/purelib/tensorflow/contrib/rnn/python/ops/_gru_ops.dll and official_tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.data/purelib/tensorflow/contrib/rnn/python/ops/_gru_ops.dll differ
Files tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.data/purelib/tensorflow/contrib/rnn/python/ops/_lstm_ops.dll and official_tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.data/purelib/tensorflow/contrib/rnn/python/ops/_lstm_ops.dll differ
Files tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.data/purelib/tensorflow/contrib/seq2seq/python/ops/_beam_search_ops.dll and official_tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.data/purelib/tensorflow/contrib/seq2seq/python/ops/_beam_search_ops.dll differ
Files tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.data/purelib/tensorflow/python/_pywrap_tensorflow_internal.pyd and official_tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.data/purelib/tensorflow/python/_pywrap_tensorflow_internal.pyd differ
Files tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.data/purelib/tensorflow/python/pywrap_tensorflow_internal.lib and official_tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.data/purelib/tensorflow/python/pywrap_tensorflow_internal.lib differ
Files tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.data/purelib/tensorflow/python/pywrap_tensorflow_internal.py and official_tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.data/purelib/tensorflow/python/pywrap_tensorflow_internal.py differ
Files tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.dist-info/RECORD and official_tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.dist-info/RECORD differ

aluo-x · 2017-08-03T07:14:11Z

Here is the code that triggered the error in my builds but not in official whls. You can simply uncomment any of the batchnorm layers and connect them to the graph.

arcadien · 2017-11-03T07:58:32Z

With version r1.3, Under Windows 10, msvc 2015, i noticed a warning about macro TF_BATCHTOSPACE_BLOCK_DIMS_CASE called with the wrong number of parameters in tensorflow\core\kernels\batchtospace_op.cc, l.197. Can it be related?

aluo-x · 2017-11-03T08:03:55Z

This was never resolved on my end, since the error was not always deterministic, and failed at different layers. I will try and build 1.4 from source w/ cuDNN 7 on windows 10 this week, and report back.

aluo-x · 2017-11-24T23:58:50Z

Closing this issue for now since it is stale.

aluo-x · 2017-12-04T06:23:47Z

Seems like this issue is still happening when compiling with AVX. See here

mrry · 2017-12-12T16:43:16Z

I won't have time to look into this, so marking it as "Contributions Welcome."

fo40225 · 2018-02-01T09:29:36Z

This issue can be solved by update msvc compiler (cl.exe) from 19.0.24210.0 to 19.0.24215.1.

poxvoculi assigned mrry Jul 31, 2017

poxvoculi added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jul 31, 2017

aluo-x mentioned this issue Aug 3, 2017

Tensorflow 1.2.0 Win64, GPU and AVX enabled yaroslavvb/tensorflow-community-wheels#24

Open

aluo-x mentioned this issue Oct 15, 2017

C1002 error when building on Windows 10 64 bit, with vs 2017 #11096

Closed

aluo-x closed this as completed Nov 24, 2017

aluo-x reopened this Dec 4, 2017

mrry removed their assignment Dec 12, 2017

mrry added stat:contribution welcome Status - Contributions welcome and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Dec 12, 2017

wt-huang closed this as completed Sep 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batchnorm errors in "successful" Windows builds #11865

Batchnorm errors in "successful" Windows builds #11865

aluo-x commented Jul 29, 2017 •

edited

aluo-x commented Jul 29, 2017 •

edited

aluo-x commented Aug 3, 2017 •

edited

arcadien commented Nov 3, 2017

aluo-x commented Nov 3, 2017

aluo-x commented Nov 24, 2017

aluo-x commented Dec 4, 2017

mrry commented Dec 12, 2017

fo40225 commented Feb 1, 2018

Batchnorm errors in "successful" Windows builds #11865

Batchnorm errors in "successful" Windows builds #11865

Comments

aluo-x commented Jul 29, 2017 • edited

aluo-x commented Jul 29, 2017 • edited

aluo-x commented Aug 3, 2017 • edited

arcadien commented Nov 3, 2017

aluo-x commented Nov 3, 2017

aluo-x commented Nov 24, 2017

aluo-x commented Dec 4, 2017

mrry commented Dec 12, 2017

fo40225 commented Feb 1, 2018

aluo-x commented Jul 29, 2017 •

edited

aluo-x commented Jul 29, 2017 •

edited

aluo-x commented Aug 3, 2017 •

edited