Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batchnorm errors in "successful" Windows builds #11865

Closed
aluo-x opened this issue Jul 29, 2017 · 8 comments
Closed

Batchnorm errors in "successful" Windows builds #11865

aluo-x opened this issue Jul 29, 2017 · 8 comments
Labels
stat:contribution welcome Status - Contributions welcome

Comments

@aluo-x
Copy link

aluo-x commented Jul 29, 2017

  • Current system configuration:
  • Windows 10 64 bit, intel i7-7700HQ latest microcode, Nvidia 1050 4GB.
  • Driver: 384.94
  • Python used: Anaconda 4.4.0 Python 3.6.2 and 3.5.3
  • CUDA/cuDNN: 8.0.61/5.1 or 8.0.61.2/6.0
  • swigwin 3.0.12
  • Built 1.2.1 from source using VS 2015 Update 3, CMake 3.9.0 or 3.9.0 RC5, swigwin 3.0.12.
  • Code modifications: in builds with both cuDNN and AVX enabled, the code was modified accord to this comment
  • Issue description: In certain conditions "successful" builds of tensorflow with GPU support, results in broken batchnorm functionality. An example error:

InvalidArgumentError (see above for traceback): indices[1] is out of range
[[Node: gradients/batch_normalization/moments/Mean_1_grad/DynamicStitch = DynamicStitch[N=2, T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/batch_normalization/moments/Mean_1_grad/range, gradients/batch_normalization/moments/Mean_1_grad/mod, gradients/batch_normalization/moments/Mean_1_grad/Shape, gradients/batch_normalization/moments/Mean_1_grad/Fill)]]

This error was encountered in a variety of different builds. But it was most surprising when it occurred in an unmodified python 3.6 gpu build. Files and configuration can be found here.

@aluo-x
Copy link
Author

aluo-x commented Jul 29, 2017

A complete error print out:

C:\Users\ALuo\Anaconda3\python.exe C:/Users/ALuo/.PyCharm2017.2/config/scratches/scratch.py
Extracting /tmp/tensorflow/mnist/input_data\train-images-idx3-ubyte.gz
Extracting /tmp/tensorflow/mnist/input_data\train-labels-idx1-ubyte.gz
Extracting /tmp/tensorflow/mnist/input_data\t10k-images-idx3-ubyte.gz
Extracting /tmp/tensorflow/mnist/input_data\t10k-labels-idx1-ubyte.gz
2017-07-29 00:09:56.891124: W c:\optimae\tensorflow-1.2.1\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE instructions, but these are available on your machine and could speed up CPU computations.
2017-07-29 00:09:56.891322: W c:\optimae\tensorflow-1.2.1\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-29 00:09:56.891533: W c:\optimae\tensorflow-1.2.1\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-29 00:09:56.891720: W c:\optimae\tensorflow-1.2.1\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-29 00:09:56.891906: W c:\optimae\tensorflow-1.2.1\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-29 00:09:56.892095: W c:\optimae\tensorflow-1.2.1\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-07-29 00:09:56.892276: W c:\optimae\tensorflow-1.2.1\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-29 00:09:56.892467: W c:\optimae\tensorflow-1.2.1\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-07-29 00:09:57.558636: I c:\optimae\tensorflow-1.2.1\tensorflow\core\common_runtime\gpu\gpu_device.cc:940] Found device 0 with properties:
name: GeForce GTX 1050
major: 6 minor: 1 memoryClockRate (GHz) 1.493
pciBusID 0000:01:00.0
Total memory: 4.00GiB
Free memory: 3.30GiB
2017-07-29 00:09:57.558862: I c:\optimae\tensorflow-1.2.1\tensorflow\core\common_runtime\gpu\gpu_device.cc:961] DMA: 0
2017-07-29 00:09:57.559124: I c:\optimae\tensorflow-1.2.1\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0: Y
2017-07-29 00:09:57.559258: I c:\optimae\tensorflow-1.2.1\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0)
step 0, training accuracy 0.24
2017-07-29 00:09:59.180245: W c:\optimae\tensorflow-1.2.1\tensorflow\core\framework\op_kernel.cc:1158] Invalid argument: indices[1] is out of range
[[Node: gradients/batch_normalization/moments/shifted_mean_grad/DynamicStitch = DynamicStitch[N=2, T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/batch_normalization/moments/shifted_mean_grad/range, gradients/batch_normalization/moments/shifted_mean_grad/mod, gradients/batch_normalization/moments/shifted_mean_grad/Shape, gradients/batch_normalization/moments/shifted_mean_grad/Fill)]]
2017-07-29 00:09:59.180255: W c:\optimae\tensorflow-1.2.1\tensorflow\core\framework\op_kernel.cc:1158] Invalid argument: indices[1] is out of range
[[Node: gradients/batch_normalization/moments/shifted_mean_grad/DynamicStitch = DynamicStitch[N=2, T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/batch_normalization/moments/shifted_mean_grad/range, gradients/batch_normalization/moments/shifted_mean_grad/mod, gradients/batch_normalization/moments/shifted_mean_grad/Shape, gradients/batch_normalization/moments/shifted_mean_grad/Fill)]]
2017-07-29 00:09:59.180265: W c:\optimae\tensorflow-1.2.1\tensorflow\core\framework\op_kernel.cc:1158] Invalid argument: indices[1] is out of range
[[Node: gradients/batch_normalization/moments/shifted_mean_grad/DynamicStitch = DynamicStitch[N=2, T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/batch_normalization/moments/shifted_mean_grad/range, gradients/batch_normalization/moments/shifted_mean_grad/mod, gradients/batch_normalization/moments/shifted_mean_grad/Shape, gradients/batch_normalization/moments/shifted_mean_grad/Fill)]]
2017-07-29 00:09:59.180277: W c:\optimae\tensorflow-1.2.1\tensorflow\core\framework\op_kernel.cc:1158] Invalid argument: indices[1] is out of range
[[Node: gradients/batch_normalization/moments/shifted_mean_grad/DynamicStitch = DynamicStitch[N=2, T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/batch_normalization/moments/shifted_mean_grad/range, gradients/batch_normalization/moments/shifted_mean_grad/mod, gradients/batch_normalization/moments/shifted_mean_grad/Shape, gradients/batch_normalization/moments/shifted_mean_grad/Fill)]]
2017-07-29 00:09:59.180283: W c:\optimae\tensorflow-1.2.1\tensorflow\core\framework\op_kernel.cc:1158] Invalid argument: indices[1] is out of range
[[Node: gradients/batch_normalization/moments/shifted_mean_grad/DynamicStitch = DynamicStitch[N=2, T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/batch_normalization/moments/shifted_mean_grad/range, gradients/batch_normalization/moments/shifted_mean_grad/mod, gradients/batch_normalization/moments/shifted_mean_grad/Shape, gradients/batch_normalization/moments/shifted_mean_grad/Fill)]]
2017-07-29 00:10:00.245425: W c:\optimae\tensorflow-1.2.1\tensorflow\core\framework\op_kernel.cc:1158] Invalid argument: indices[1] is out of range
[[Node: gradients/batch_normalization/moments/shifted_mean_grad/DynamicStitch = DynamicStitch[N=2, T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/batch_normalization/moments/shifted_mean_grad/range, gradients/batch_normalization/moments/shifted_mean_grad/mod, gradients/batch_normalization/moments/shifted_mean_grad/Shape, gradients/batch_normalization/moments/shifted_mean_grad/Fill)]]
Traceback (most recent call last):
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1139, in _do_call
return fn(*args)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1121, in _run_fn
status, run_metadata)
File "C:\Users\ALuo\Anaconda3\lib\contextlib.py", line 88, in exit
next(self.gen)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[1] is out of range
[[Node: gradients/batch_normalization/moments/shifted_mean_grad/DynamicStitch = DynamicStitch[N=2, T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/batch_normalization/moments/shifted_mean_grad/range, gradients/batch_normalization/moments/shifted_mean_grad/mod, gradients/batch_normalization/moments/shifted_mean_grad/Shape, gradients/batch_normalization/moments/shifted_mean_grad/Fill)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:/Users/ALuo/.PyCharm2017.2/config/scratches/scratch.py", line 60, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(sys.argv[:1] + flags_passthrough))
File "C:/Users/ALuo/.PyCharm2017.2/config/scratches/scratch.py", line 54, in main
train_step.run(feed_dict={x: batch[0], y
: batch[1], train_phase: True})
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 1706, in run
_run_using_default_session(self, feed_dict, self.graph, session)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 3963, in _run_using_default_session
session.run(operation, feed_dict)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 789, in run
run_metadata_ptr)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 997, in _run
feed_dict_string, options, run_metadata)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1132, in _do_run
target_list, options, run_metadata)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[1] is out of range
[[Node: gradients/batch_normalization/moments/shifted_mean_grad/DynamicStitch = DynamicStitch[N=2, T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/batch_normalization/moments/shifted_mean_grad/range, gradients/batch_normalization/moments/shifted_mean_grad/mod, gradients/batch_normalization/moments/shifted_mean_grad/Shape, gradients/batch_normalization/moments/shifted_mean_grad/Fill)]]

Caused by op 'gradients/batch_normalization/moments/shifted_mean_grad/DynamicStitch', defined at:
File "C:/Users/ALuo/.PyCharm2017.2/config/scratches/scratch.py", line 60, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "C:/Users/ALuo/.PyCharm2017.2/config/scratches/scratch.py", line 42, in main
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\training\optimizer.py", line 315, in minimize
grad_loss=grad_loss)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\training\optimizer.py", line 386, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\ops\gradients_impl.py", line 540, in gradients
grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\ops\gradients_impl.py", line 346, in _MaybeCompile
return grad_fn() # Exit early
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\ops\gradients_impl.py", line 540, in
grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\ops\math_grad.py", line 94, in _MeanGrad
sum_grad = _SumGrad(op, grad)[0]
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\ops\math_grad.py", line 56, in _SumGrad
output_shape_kept_dims = math_ops.reduced_shape(input_shape, op.inputs[1])
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\ops\math_ops.py", line 2272, in reduced_shape
array_ops.fill(axes_shape, 1)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\ops\gen_data_flow_ops.py", line 481, in dynamic_stitch
name=name)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
op_def=op_def)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 1269, in init
self._traceback = _extract_stack()

...which was originally created as op 'batch_normalization/moments/shifted_mean', defined at:
File "C:/Users/ALuo/.PyCharm2017.2/config/scratches/scratch.py", line 60, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
[elided 0 identical lines from previous traceback]
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "C:/Users/ALuo/.PyCharm2017.2/config/scratches/scratch.py", line 25, in main
h_norm4 = tf.layers.batch_normalization(h_conv4, training=train_phase)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\layers\normalization.py", line 441, in batch_normalization
return layer.apply(inputs, training=training)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\layers\base.py", line 492, in apply
return self.call(inputs, *args, **kwargs)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\layers\base.py", line 441, in call
outputs = self.call(inputs, *args, **kwargs)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\layers\normalization.py", line 287, in call
mean, variance = nn.moments(inputs, reduction_axes)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\ops\nn_impl.py", line 642, in moments
math_ops.subtract(y, shift), axes, keep_dims=True, name="shifted_mean")
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1355, in reduce_mean
name=name)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 1290, in _mean
keep_dims=keep_dims, name=name)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
op_def=op_def)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "C:\Users\ALuo\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 1269, in init
self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): indices[1] is out of range
[[Node: gradients/batch_normalization/moments/shifted_mean_grad/DynamicStitch = DynamicStitch[N=2, T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/batch_normalization/moments/shifted_mean_grad/range, gradients/batch_normalization/moments/shifted_mean_grad/mod, gradients/batch_normalization/moments/shifted_mean_grad/Shape, gradients/batch_normalization/moments/shifted_mean_grad/Fill)]]

Process finished with exit code 1

An entry point might be comparing my failing python 3.6.2 build with no modifications with the official tensorflow build.

Running a diff -rqb:

Files tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.data/purelib/tensorflow/contrib/rnn/python/ops/_gru_ops.dll and official_tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.data/purelib/tensorflow/contrib/rnn/python/ops/_gru_ops.dll differ
Files tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.data/purelib/tensorflow/contrib/rnn/python/ops/_lstm_ops.dll and official_tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.data/purelib/tensorflow/contrib/rnn/python/ops/_lstm_ops.dll differ
Files tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.data/purelib/tensorflow/contrib/seq2seq/python/ops/_beam_search_ops.dll and official_tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.data/purelib/tensorflow/contrib/seq2seq/python/ops/_beam_search_ops.dll differ
Files tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.data/purelib/tensorflow/python/_pywrap_tensorflow_internal.pyd and official_tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.data/purelib/tensorflow/python/_pywrap_tensorflow_internal.pyd differ
Files tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.data/purelib/tensorflow/python/pywrap_tensorflow_internal.lib and official_tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.data/purelib/tensorflow/python/pywrap_tensorflow_internal.lib differ
Files tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.data/purelib/tensorflow/python/pywrap_tensorflow_internal.py and official_tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.data/purelib/tensorflow/python/pywrap_tensorflow_internal.py differ
Files tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.dist-info/RECORD and official_tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64/tensorflow_gpu-1.2.1.dist-info/RECORD differ

@aluo-x
Copy link
Author

aluo-x commented Aug 3, 2017

Here is the code that triggered the error in my builds but not in official whls. You can simply uncomment any of the batchnorm layers and connect them to the graph.

@arcadien
Copy link

arcadien commented Nov 3, 2017

With version r1.3, Under Windows 10, msvc 2015, i noticed a warning about macro TF_BATCHTOSPACE_BLOCK_DIMS_CASE called with the wrong number of parameters in tensorflow\core\kernels\batchtospace_op.cc, l.197. Can it be related?

@aluo-x
Copy link
Author

aluo-x commented Nov 3, 2017

This was never resolved on my end, since the error was not always deterministic, and failed at different layers. I will try and build 1.4 from source w/ cuDNN 7 on windows 10 this week, and report back.

@aluo-x
Copy link
Author

aluo-x commented Nov 24, 2017

Closing this issue for now since it is stale.

@aluo-x aluo-x closed this as completed Nov 24, 2017
@aluo-x
Copy link
Author

aluo-x commented Dec 4, 2017

Seems like this issue is still happening when compiling with AVX. See here

@aluo-x aluo-x reopened this Dec 4, 2017
@mrry mrry removed their assignment Dec 12, 2017
@mrry mrry added stat:contribution welcome Status - Contributions welcome and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Dec 12, 2017
@mrry
Copy link
Contributor

mrry commented Dec 12, 2017

I won't have time to look into this, so marking it as "Contributions Welcome."

@fo40225
Copy link
Contributor

fo40225 commented Feb 1, 2018

This issue can be solved by update msvc compiler (cl.exe) from 19.0.24210.0 to 19.0.24215.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:contribution welcome Status - Contributions welcome
Projects
None yet
Development

No branches or pull requests

6 participants