New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tf_cnn_benchmarks.py does not support --data_dir with my imagenet1k tfrecords #197
Comments
@rreece I used with --data_name=imagenet --data_dir=/home/IMAGENETDATA/tfrecrd/train/ and it works. |
Thanks for the follow-up! First, I should say that I'm running on a custom Dell C4140 server that has no gpu or other accelerator. It has 2 x 16 cores (Xeon Gold 6130). When I run like you say above, it hangs on "Running warm up" just like I mentioned before.
I add some options to select
|
The warm up is 10 steps fo 640 images. No idea how fast trivial is on CPU as that is not a test that I run. The benchmark code is not default setup for CPU and it limits the thread of CPU. I link to the guide below and try the following command. 0 means let the system pick, which means number of logical threads in your case maybe 64: python tf_cnn_benchmarks.py --device=cpu \
--nodistortions --model=resnet50 --data_format=NHWC --batch_size=1 \
--num_inter_threads=0 --num_intra_threads=0 \
--data_dir=<path to ImageNet TFRecords>
You also could set the warmup (search the args for something like num_warm_steps
or something like that) to be shorter, I do that with CPU as I hate waiting. Just don't
use Zero i doubt we catch that or we might. Set it to 1 or something.
If you have not ready see the perf guide on CPU for some expected training times for CPU and ResNet50 as an example. About 6.2 images/second with I think 36 physical cores. If you use MKL you may want to try different settings. See the perf guide. |
Hi, Thanks a lot for the feedback. In all variations I've tried, it is still hanging in the same place, at
https://github.com/tensorflow/benchmarks/blob/master/scripts/tf_cnn_benchmarks/benchmark_cnn.py#L663 I have some confidence in my dataset because I am able to run the training in https://github.com/tensorflow/models/tree/master/official/resnet Are you saying the warmup may take 10 x 640 / 6.2 = 1032 secs = 17.2 minutes? For me, htop shows basically no cpu utilization. |
I've let it run for 45 minutes now, still hung, so I do not think that it is a matter of not waiting for the warm up to finish. |
It's difficult to debug this because we don't have the images. Can you try running with If those don't work, I'm not sure how to debug this. |
Sorry for the delay. I confirm with @guiramirez, |
I switched back to the HEAD of master today. But the following does not work (with a different error):
Again, switching back to commit
|
To use the newer versions of benchmarks you would use a newer version of tensorflow.#202 |
Sorry it took me so long to get back to this. I tried the head of benchmarks today with tensorflow 1.9.0, and it worked! Thanks for the feedback. Closing this issue. |
however, i am failed with tensorflow 1.10 , how can i slove it ?
result:
|
@AnberLu, this is similar to #150. I cannot reproduce. What commit of tf_cnn_benchmarks are you using? Are you using Python 2 or 3? And, are you using the full imagenet dataset, or custom data? |
@reedwm i Use Python 2, tensorflow 1.10, and full imagenet dataset |
I'm getting the same num_steps error with Python 3.6.3, tf-nightly-gpu, and the latest benchmark code. |
I'm having the same issue with Python 3, TF 1.10 (from the AMD ROCm docker image), and the full, proper ImageNet dataset. I can train a resnet50 network fine with my tf_record, but the benchmark script fails with the same num_steps error referenced above. |
This is a very strange error, because
@AnberLu gave almost everything, but didn't give me the tf_cnn_benchmarks commit. |
|
@addisokw, can you try on this fork/branch: https://github.com/reedwm/benchmarks/tree/num_steps_issue? I forked the repo and added some extra debugging output, which will help me determine the issue. After cloning make you run |
When trying the num_steps_issue branch from above, I get an immediate error
|
Whoops, didn't realize you were on TensorFlow 1.10. Try the num_steps_issue_1.10 branch instead. |
Here's the output from that new branch, with the additional debug information!
|
I also had this issue before. And it turns out that the reason is the permission for the TFRecords database folder was not set correctly. I didn't have the right to read that folder. But Python didn't give the correct information point out the exact failure reason. |
Before, if an OutOfRangeError was thrown, the Supervisor would silently ignore it, causing the cryptic error: "UnboundLocalError: local variable 'num_steps' referenced before assignment". See issues #150 and #197. This change adds a try-catch to catch the OutOfRangeError, wrapping it with a RuntimeError, so that Supervisor does not ignore it. Also, everything under the managed_session context manager has been moved to a new function, to avoid deep nesting. PiperOrigin-RevId: 217622216
I'm using the HEAD of both
tensorflow
andbenchmarks
.I can run the
tf_cnn_benchmarks.py
with synthetic data like this:But when I try to specify my own local data_dir of tfrecords for imagenet1k, it hangs sometime after printing "Running warm up":
and then it hangs.
Any ideas how I can debug using my own local dataset?
I noticed these seemingly related closed issues: #150 and #176, but they do not seem to be hanging at the same place
tf_cnn_benchmarks.py
does for me.Thanks!
The text was updated successfully, but these errors were encountered: