Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf_cnn_benchmarks.py does not support --data_dir with my imagenet1k tfrecords #197

Closed
rreece opened this issue May 30, 2018 · 25 comments
Closed
Assignees

Comments

@rreece
Copy link

rreece commented May 30, 2018

I'm using the HEAD of both tensorflow and benchmarks.
I can run the tf_cnn_benchmarks.py with synthetic data like this:

python3 tf_cnn_benchmarks.py --num_batches=100 --display_every=1 --device=cpu --data_format=NHWC --model=trivial --batch_size=64

But when I try to specify my own local data_dir of tfrecords for imagenet1k, it hangs sometime after printing "Running warm up":

python3 tf_cnn_benchmarks.py --num_batches=100 --display_every=1 --device=cpu --data_format=NHWC --model=trivial --batch_size=64 --data_dir=/n0/ryan/imagenet1k_tfrecord
TensorFlow:  1.8  
Model:       trivial
Dataset:     imagenet
Mode:        training
SingleSess:  False
Batch size:  64 global
             64.0 per device
Num batches: 100
Num epochs:  0.00 
Devices:     ['/cpu:0']
Data format: NHWC 
Layout optimizer: False
Optimizer:   sgd  
Variables:   parameter_server
==========
Generating model
W0530 13:48:44.750849 140466104280896 tf_logging.py:125] From /home/ryan/sandbox/rreece/onboarding-cerebras/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1611: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-05-30 13:48:44.798403: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F
I0530 13:48:44.929922 140466104280896 tf_logging.py:115] Running local_init_op.
I0530 13:48:50.095620 140466104280896 tf_logging.py:115] Done running local_init_op.
Running warm up

and then it hangs.

Any ideas how I can debug using my own local dataset?

I noticed these seemingly related closed issues: #150 and #176, but they do not seem to be hanging at the same place tf_cnn_benchmarks.py does for me.

Thanks!

@abidmalikwaterloo
Copy link

@rreece I used with --data_name=imagenet --data_dir=/home/IMAGENETDATA/tfrecrd/train/ and it works.

@rreece
Copy link
Author

rreece commented Jun 5, 2018

Thanks for the follow-up!

First, I should say that I'm running on a custom Dell C4140 server that has no gpu or other accelerator. It has 2 x 16 cores (Xeon Gold 6130).

When I run like you say above, it hangs on "Running warm up" just like I mentioned before.

python3 tf_cnn_benchmarks.py --data_name=imagenet --data_dir=/n0/ryan/imagenet1k_tfrecord
TensorFlow:  1.8  
Model:       trivial
Dataset:     imagenet
Mode:        training
SingleSess:  False
Batch size:  32 global
             32.0 per device
Num batches: 100
Num epochs:  0.00 
Devices:     ['/gpu:0']
Data format: NCHW 
Layout optimizer: False
Optimizer:   sgd  
Variables:   parameter_server
==========
Generating model
W0604 10:01:16.727709 140183955326784 tf_logging.py:125] From /home/ryan/sandbox/rreece/onboarding-cerebras/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1611: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-06-04 10:01:16.776187: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F
I0604 10:01:16.895823 140183955326784 tf_logging.py:115] Running local_init_op.
I0604 10:01:22.039721 140183955326784 tf_logging.py:115] Done running local_init_op.
Running warm up
...  (hangs)

I add some options to select device=cpu instead of gpu and data_format=NHWC because I am pretty sure my local tfrecords are written that way.

python3 tf_cnn_benchmarks.py --num_batches=100 --display_every=1 --device=cpu --data_format=NHWC --model=trivial --batch_size=64 --data_name=imagenet --data_dir=/n0/ryan/imagenet1k_tfrecord
TensorFlow:  1.8  
Model:       trivial
Dataset:     imagenet
Mode:        training
SingleSess:  False
Batch size:  64 global
             64.0 per device
Num batches: 100
Num epochs:  0.00 
Devices:     ['/cpu:0']
Data format: NHWC 
Layout optimizer: False
Optimizer:   sgd  
Variables:   parameter_server
==========
Generating model
W0604 09:58:23.847348 140426272913216 tf_logging.py:125] From /home/ryan/sandbox/rreece/onboarding-cerebras/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1611: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-06-04 09:58:23.895069: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F
I0604 09:58:24.015544 140426272913216 tf_logging.py:115] Running local_init_op.
I0604 09:58:29.259888 140426272913216 tf_logging.py:115] Done running local_init_op.
Running warm up
...  (hangs)

@tfboyd
Copy link
Member

tfboyd commented Jun 5, 2018

The warm up is 10 steps fo 640 images. No idea how fast trivial is on CPU as that is not a test that I run. The benchmark code is not default setup for CPU and it limits the thread of CPU. I link to the guide below and try the following command. 0 means let the system pick, which means number of logical threads in your case maybe 64:

python tf_cnn_benchmarks.py --device=cpu  \
--nodistortions --model=resnet50 --data_format=NHWC --batch_size=1 \
--num_inter_threads=0 --num_intra_threads=0 \  
--data_dir=<path to ImageNet TFRecords>

You also could set the warmup (search the args for something like num_warm_steps 
or something like that) to be shorter, I do that with CPU as I hate waiting.   Just don't
 use Zero i doubt we catch that or we might.  Set it to 1 or something.

If you have not ready see the perf guide on CPU for some expected training times for CPU and ResNet50 as an example. About 6.2 images/second with I think 36 physical cores.

If you use MKL you may want to try different settings. See the perf guide.

@rreece
Copy link
Author

rreece commented Jun 7, 2018

Hi,

Thanks a lot for the feedback.

In all variations I've tried, it is still hanging in the same place, at

results = sess.run(fetches, options=run_options, run_metadata=run_metadata)

https://github.com/tensorflow/benchmarks/blob/master/scripts/tf_cnn_benchmarks/benchmark_cnn.py#L663

I have some confidence in my dataset because I am able to run the training in tensorflow/models/official/resnet with the same tfrecords.

https://github.com/tensorflow/models/tree/master/official/resnet

Are you saying the warmup may take 10 x 640 / 6.2 = 1032 secs = 17.2 minutes? For me, htop shows basically no cpu utilization.

@rreece
Copy link
Author

rreece commented Jun 7, 2018

I've let it run for 45 minutes now, still hung, so I do not think that it is a matter of not waiting for the warm up to finish.

@reedwm
Copy link
Member

reedwm commented Jun 7, 2018

It's difficult to debug this because we don't have the images. Can you try running with --forward_only, --device=cpu --data_format=NCHW, and --use_datasets=false? (Run each of the three separately). Also, you can try running an older commit, like cbaedcf.

If those don't work, I'm not sure how to debug this.

@GRGargallo
Copy link

I have the same error. But using commit cbaedcf that @reedwm toll us before works perfectly. Is possible that is the error is caused by use of CPU only nodes.

@rreece
Copy link
Author

rreece commented Jun 12, 2018

Sorry for the delay. I confirm with @guiramirez, cbaedcf works with the private imagenet1k tfrecords I have been using!

@wei-v-wang
Copy link

wei-v-wang commented Jun 12, 2018

@rreece Sorry for the delayed response! Just notice this issue.
I am pretty confident this issue (prefetching enabled for CPU by default, which should be disabled) is fixed and merged by @reedwm now in this PR #181
So @rreece you can use the latest commit now if you are willing to.

@rreece
Copy link
Author

rreece commented Jun 14, 2018

I switched back to the HEAD of master today. But the following does not work (with a different error):

python3 tf_cnn_benchmarks.py --device=cpu --nodistortions --model=resnet50 --data_format=NHWC --data_dir=/n0/ryan/imagenet1k_tfrecord
Traceback (most recent call last):
  File "tf_cnn_benchmarks.py", line 27, in <module>
    import benchmark_cnn
  File "/home/ryan/sandbox/rreece/onboarding-cerebras/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 51, in <module>
    import variable_mgr
  File "/home/ryan/sandbox/rreece/onboarding-cerebras/benchmarks/scripts/tf_cnn_benchmarks/variable_mgr.py", line 25, in <module>
    import allreduce
  File "/home/ryan/sandbox/rreece/onboarding-cerebras/benchmarks/scripts/tf_cnn_benchmarks/allreduce.py", line 28, in <module>
    from tensorflow.python.ops import collective_ops
ImportError: cannot import name 'collective_ops'

Again, switching back to commit cbaedcf worked (adding some options to shorten the warmup):

git checkout cbaedcf880f365d4a98c5f6b5bd31d3dffcb72cd
python3 tf_cnn_benchmarks.py --device=cpu --nodistortions --model=resnet50 --data_format=NHWC --data_dir=/n0/ryan/imagenet1k_tfrecord --batch_size=1 --num_warmup_batches=4 --forward_only=True

TensorFlow:  1.8
Model:       resnet50
Dataset:     imagenet
Mode:        forward-only
SingleSess:  False
Batch size:  1 global
             1.0 per device
Devices:     ['/cpu:0']
Data format: NHWC
Layout optimizer: False
Optimizer:   sgd
Variables:   parameter_server
==========
Generating model
WARNING:tensorflow:From /home/ryan/sandbox/rreece/onboarding-cerebras/benchmarks/scripts/tf_cnn_benchmarks/convnet_builder.py:372: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From /home/ryan/sandbox/rreece/onboarding-cerebras/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1120: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-06-14 13:05:03.744268: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F
Running warm up
Done warm up
Step    Img/sec loss    top_1_accuracy  top_5_accuracy
1       images/sec: 2.7 +/- 0.0 (jitter = 0.0)  0.000   0.000   0.000
10      images/sec: 2.8 +/- 0.0 (jitter = 0.1)  0.000   0.000   0.000
20      images/sec: 2.7 +/- 0.0 (jitter = 0.1)  0.000   0.000   0.000
30      images/sec: 2.7 +/- 0.0 (jitter = 0.1)  0.000   0.000   0.000
40      images/sec: 2.7 +/- 0.0 (jitter = 0.1)  0.000   0.000   0.000
50      images/sec: 2.7 +/- 0.0 (jitter = 0.1)  0.000   0.000   0.000
60      images/sec: 2.7 +/- 0.0 (jitter = 0.1)  0.000   0.000   0.000
70      images/sec: 2.7 +/- 0.0 (jitter = 0.1)  0.000   0.000   0.000
80      images/sec: 2.7 +/- 0.0 (jitter = 0.1)  0.000   0.000   0.000
90      images/sec: 2.7 +/- 0.0 (jitter = 0.1)  0.000   0.000   0.000
100     images/sec: 2.7 +/- 0.0 (jitter = 0.1)  0.000   0.000   0.000
----------------------------------------------------------------
total images/sec: 2.74
----------------------------------------------------------------

@GRGargallo
Copy link

To use the newer versions of benchmarks you would use a newer version of tensorflow.#202
I use tensorflow1.9 and works perfectly.

@rreece
Copy link
Author

rreece commented Jul 13, 2018

Sorry it took me so long to get back to this.

I tried the head of benchmarks today with tensorflow 1.9.0, and it worked! Thanks for the feedback. Closing this issue.

@rreece rreece closed this as completed Jul 13, 2018
@MichaelGou1105
Copy link

however, i am failed with tensorflow 1.10 , how can i slove it ?
cmd:

python  tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 \
--batch_size=64 --model=resnet50 --variable_update=replicated \
--optimizer=sgd --variable_update=replicated --gradient_repacking=4 --weight_decay=1e-4  \
--num_warmup_batches=4  --nodistortions  --data_format=NCHW \
--print_training_accuracy  \
--train_dir=/export/home/imagenet/benchmarks/scripts/tf_cnn_benchmarks/single_ckpt \
--data_name=imagenet --data_dir=/export/Data/imagenet/tfrecords

result:

Traceback (most recent call last):
  File "tf_cnn_benchmarks.py", line 60, in <module>
    app.run(main)  # Raises error on invalid flags, unlike tf.app.run()
  File "/export/home/.virtualenv/tf/lib/python2.7/site-packages/absl/app.py", line 278, in run
    _run_main(main, args)
  File "/export/home/.virtualenv/tf/lib/python2.7/site-packages/absl/app.py", line 239, in _run_main
    sys.exit(main(argv))
  File "tf_cnn_benchmarks.py", line 56, in main
    bench.run()
  File "/export/home/imagenet/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1523, in run
    return self._benchmark_train()
  File "/export/home/imagenet/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1646, in _benchmark_train
    return self._benchmark_graph(result_to_benchmark)
  File "/export/home/imagenet/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1979, in _benchmark_graph
    'num_steps': num_steps,
UnboundLocalError: local variable 'num_steps' referenced before assignment

@reedwm
Copy link
Member

reedwm commented Aug 29, 2018

@AnberLu, this is similar to #150. I cannot reproduce. What commit of tf_cnn_benchmarks are you using? Are you using Python 2 or 3? And, are you using the full imagenet dataset, or custom data?

@MichaelGou1105
Copy link

MichaelGou1105 commented Sep 11, 2018

@reedwm i Use Python 2, tensorflow 1.10, and full imagenet dataset

@grantgumina
Copy link

I'm getting the same num_steps error with Python 3.6.3, tf-nightly-gpu, and the latest benchmark code.

@addisokw
Copy link

I'm having the same issue with Python 3, TF 1.10 (from the AMD ROCm docker image), and the full, proper ImageNet dataset. I can train a resnet50 network fine with my tf_record, but the benchmark script fails with the same num_steps error referenced above.

@reedwm
Copy link
Member

reedwm commented Sep 28, 2018

This is a very strange error, because num_steps is assigned in every code path before it it used. I don't see how it this error could occur, so I must be missing something. Can someone with the num_steps error provide the following?

  1. The command line they used to run tf_cnn_benchmarks
  2. The version of TensorFlow being used, and how it was obtained (pip, install from source, etc.)
  3. The tf_cnn_benchmarks commit being used.
  4. The full output of tf_cnn_benchmarks, including the stacktrace.

@AnberLu gave almost everything, but didn't give me the tf_cnn_benchmarks commit.

@addisokw
Copy link

This is a very strange error, because num_steps is assigned in every code path before it it used. I don't see how it this error could occur, so I must be missing something. Can someone with the num_steps error provide the following?

  1. The command line they used to run tf_cnn_benchmarks
    python3 ./tf_cnn_benchmarks.py --model=resnet50 --data_dir=/data/ImageNet --data_name=imagenet
  1. The version of TensorFlow being used, and how it was obtained (pip, install from source, etc.)
    tensorflow-1.10.0-cp35-cp35m-linux_x86_64 from the AMD ROCm
  1. The tf_cnn_benchmarks commit being used.
    cnn_tf_v1.10_compatible
  1. The full output of tf_cnn_benchmarks, including the stacktrace.
    TensorFlow: 1.10 Model: resnet50 Dataset: imagenet Mode: training SingleSess: False Batch size: 64 global 64.0 per device Num batches: 100 Num epochs: 0.00 Devices: ['/gpu:0'] Data format: NCHW Layout optimizer: False Optimizer: sgd Variables: parameter_server ========== Generating model W0928 22:14:30.133174 139972170786560 tf_logging.py:125] From /root/workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1778: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2018-09-28 22:14:30.565075: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F 2018-09-28 22:14:30.571252: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1519] Found device 0 with properties: name: Device 6863 AMDGPU ISA: gfx900 memoryClockRate (GHz) 1.6 pciBusID 0000:19:00.0 Total memory: 15.98GiB Free memory: 15.73GiB 2018-09-28 22:14:30.571274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1630] Adding visible gpu devices: 0 2018-09-28 22:14:30.571289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1039] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-09-28 22:14:30.571295: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] 0 2018-09-28 22:14:30.571299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1058] 0: N 2018-09-28 22:14:30.571330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15306 MB memory) -> physical GPU (device: 0, name: Device 6863, pci bus id: 0000:19:00.0) I0928 22:14:35.356503 139972170786560 tf_logging.py:115] Running local_init_op. I0928 22:14:40.353496 139972170786560 tf_logging.py:115] Done running local_init_op. Running warm up Traceback (most recent call last): File "./tf_cnn_benchmarks.py", line 60, in <module> app.run(main) # Raises error on invalid flags, unlike tf.app.run() File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run _run_main(main, args) File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "./tf_cnn_benchmarks.py", line 56, in main bench.run() File "/root/workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1480, in run return self._benchmark_train() File "/root/workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1603, in _benchmark_train return self._benchmark_graph(result_to_benchmark) File "/root/workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1931, in _benchmark_graph 'num_steps': num_steps, UnboundLocalError: local variable 'num_steps' referenced before assignment

@AnberLu gave almost everything, but didn't give me the tf_cnn_benchmarks commit.

@reedwm
Copy link
Member

reedwm commented Sep 28, 2018

@addisokw, thank you for the info. @protoget found a possible way for this to happen. If the Supervisor managed session catches an exception but does not reraise it, it's possible for num_steps to be never be assigned.

I will look more into this.

@reedwm reedwm reopened this Sep 28, 2018
@reedwm reedwm self-assigned this Sep 28, 2018
@reedwm
Copy link
Member

reedwm commented Sep 28, 2018

@addisokw, can you try on this fork/branch: https://github.com/reedwm/benchmarks/tree/num_steps_issue? I forked the repo and added some extra debugging output, which will help me determine the issue. After cloning make you run git checkout num_steps_issue.

@addisokw
Copy link

When trying the num_steps_issue branch from above, I get an immediate error

root@deeplearn:/root/workspace/benchmarks/scripts/tf_cnn_benchmarks# python3 ./tf_cnn_benchmarks.py --model=resnet50 --data_dir=/data/ImageNet --data_name=imagenet --num_gpus=1 Traceback (most recent call last): File "./tf_cnn_benchmarks.py", line 27, in <module> import benchmark_cnn File "/root/workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 40, in <module> import data_utils File "/root/workspace/benchmarks/scripts/tf_cnn_benchmarks/data_utils.py", line 25, in <module> from tensorflow.python.data.ops import multi_device_iterator_ops ImportError: cannot import name 'multi_device_iterator_ops'

@reedwm
Copy link
Member

reedwm commented Oct 1, 2018

Whoops, didn't realize you were on TensorFlow 1.10. Try the num_steps_issue_1.10 branch instead.

@addisokw
Copy link

addisokw commented Oct 1, 2018

Whoops, didn't realize you were on TensorFlow 1.10. Try the num_steps_issue_1.10 branch instead.

Here's the output from that new branch, with the additional debug information!

Generating model
W1001 18:05:28.013666 139758063507200 tf_logging.py:125] From /root/workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1778: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-10-01 18:05:28.457261: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F
2018-10-01 18:05:28.463277: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1519] Found device 0 with properties: 
name: Device 6863
AMDGPU ISA: gfx900
memoryClockRate (GHz) 1.6
pciBusID 0000:19:00.0
Total memory: 15.98GiB
Free memory: 15.73GiB
2018-10-01 18:05:28.463296: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1630] Adding visible gpu devices: 0
2018-10-01 18:05:28.463312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1039] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-01 18:05:28.463316: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045]      0 
2018-10-01 18:05:28.463321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1058] 0:   N 
2018-10-01 18:05:28.463363: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15306 MB memory) -> physical GPU (device: 0, name: Device 6863, pci bus id: 0000:19:00.0)
I1001 18:05:33.237761 139758063507200 tf_logging.py:115] Running local_init_op.
I1001 18:05:38.235556 139758063507200 tf_logging.py:115] Done running local_init_op.
Running warm up
ENCOUNTERED FOLLOWING EXCEPTION:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1278, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1263, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
	 [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[?,1], [?,224,224,3]], output_types=[DT_INT32, DT_FLOAT], _device="/device:CPU:0"](IteratorFromStringHandle)]]
	 [[Node: tower_0/v/FunctionBufferingResourceGetNext = FunctionBufferingResourceGetNext[_class=["loc:@tower...d/Switch_1"], output_types=[DT_FLOAT, DT_INT32], _device="/job:localhost/replica:0/task:0/device:GPU:0"](input_processing/FunctionBufferingResource)]]
	 [[Node: average_loss/Mean/_591 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4546_average_loss/Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1881, in _benchmark_graph
    collective_graph_key=collective_graph_key)
  File "/root/workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 694, in benchmark_one_step
    results = sess.run(fetches, options=run_options, run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 877, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1100, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1272, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1291, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
	 [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[?,1], [?,224,224,3]], output_types=[DT_INT32, DT_FLOAT], _device="/device:CPU:0"](IteratorFromStringHandle)]]
	 [[Node: tower_0/v/FunctionBufferingResourceGetNext = FunctionBufferingResourceGetNext[_class=["loc:@tower...d/Switch_1"], output_types=[DT_FLOAT, DT_INT32], _device="/job:localhost/replica:0/task:0/device:GPU:0"](input_processing/FunctionBufferingResource)]]
	 [[Node: average_loss/Mean/_591 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4546_average_loss/Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Traceback (most recent call last):
  File "./tf_cnn_benchmarks.py", line 60, in <module>
    app.run(main)  # Raises error on invalid flags, unlike tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "./tf_cnn_benchmarks.py", line 56, in main
    bench.run()
  File "/root/workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1480, in run
    return self._benchmark_train()
  File "/root/workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1603, in _benchmark_train
    return self._benchmark_graph(result_to_benchmark)
  File "/root/workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1937, in _benchmark_graph
    'num_steps': num_steps,
UnboundLocalError: local variable 'num_steps' referenced before assignmen
```t`

@renganxu
Copy link

renganxu commented Oct 8, 2018

I also had this issue before. And it turns out that the reason is the permission for the TFRecords database folder was not set correctly. I didn't have the right to read that folder. But Python didn't give the correct information point out the exact failure reason.

tensorflow-copybara pushed a commit that referenced this issue Oct 18, 2018
Before, if an OutOfRangeError was thrown, the Supervisor would silently ignore it, causing the cryptic error: "UnboundLocalError: local variable 'num_steps' referenced before assignment". See issues #150 and #197. This change adds a try-catch to catch the OutOfRangeError, wrapping it with a RuntimeError, so that Supervisor does not ignore it.

Also, everything under the managed_session context manager has been moved to a new function, to avoid deep nesting.

PiperOrigin-RevId: 217622216
@rreece rreece closed this as completed Sep 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants