Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation in Object Detection hanging #2225

Open
RobinBaumann opened this issue Aug 16, 2017 · 42 comments

Comments

Projects
None yet
@RobinBaumann
Copy link

commented Aug 16, 2017

System information

  • What is the top-level directory of the model you are using: tensorflow/models/object_detection
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes (well, i actually just adjusted the pipeline config to fit my dataset)
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 64bit
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 1.2.1
  • Bazel version (if compiling from source):
  • CUDA/cuDNN version: 8.0 /5.1
  • GPU model and memory: GeForce GTX1060 6GB
  • Exact command to reproduce: python object_detection\eval.py --logtostderr --pipeline_config_path=C:\Users\robin\PycharmProjects\test\my_net\models\faster_r-cnn_resnet101\ pipeline.config --checkpoint_dir=C:\Users\robin\PycharmProjects\test\my_net\models\faster_r-cnn_resnet101\train\ --eval_dir=C:\Users\robin\PycharmProjects\test\my_net\models\faster_r-cnn_resnet101\eval\

Describe the problem

I am able to train with the object detecion API on my own dataset, which I created using the create_pascal_tf_record.py script (I adjusted it a bit, but mainly the paths). I also checked the generated TFRecord files with the Tensorflow Testing module and verified, that the reconstructed images are similar to the original ones.

I use the existing faster_r-cnn_resnet101_voc07.config file and only adjusted the paths and num_classes. The training runs like a charm, but when I start the eval.py script, it hangs with the message "INFO:tensorflow:Restoring parameters from C:\Users\robin\PycharmProjects\test\my-net\models\faster_r-cnn_resnet101\train\model.ckpt-123805" (see full log below).

After this I have to CTRL+C

However, I can see some output in Tensorboard, but only one value after each CTRL+C for mAP but nothing else in the other diagrams etc.

As mentioned by others having the same issue, running the evaluation and training parallel doesn't work for me, and I can't even imagine that it should be done this way. When I try it, my cuda crashed because the GPU runs out of memory.

Btw I also tried the whole evaluation process on the Oxford-IIIT Pet Dataset and am facing the same issue.

Source code / logs

The whole log after I hit CTRL+C (the part where it hangs is bold):

C:\Users\robin\models>python object_detection\eval.py --logtostderr --pipeline_config_path=C:\Users\robin\PycharmProjects\test\my_net\models\faster_r-cnn_resnet101\pipeline.config --checkpoint_dir=C:\Users\robin\PycharmProjects\test\my_net\models\faster_r-cnn_resnet101\train\ --eval_dir=C:\Users\robin\PycharmProjects\test\my_net\models\faster_r-cnn_resnet101\eval
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
2017-08-16 07:40:03.000943: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 07:40:03.001072: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 07:40:03.001933: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 07:40:03.002044: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 07:40:03.002153: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 07:40:03.002263: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 07:40:03.002357: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 07:40:03.002451: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 07:40:03.313527: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:940] Found device 0 with properties:
name: GeForce GTX 1060 6GB
major: 6 minor: 1 memoryClockRate (GHz) 1.7085
pciBusID 0000:01:00.0
Total memory: 6.00GiB
Free memory: 5.01GiB
2017-08-16 07:40:03.313690: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:961] DMA: 0
2017-08-16 07:40:03.314894: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0: Y
2017-08-16 07:40:03.314995: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0)
INFO:tensorflow:Restoring parameters from C:\Users\robin\PycharmProjects\test\my_net\models\faster_r-cnn_resnet101\train\model.ckpt-123805
INFO:tensorflow:Restoring parameters from C:\Users\robin\PycharmProjects\test\my_net\models\faster_r-cnn_resnet101\train\model.ckpt-123805

Traceback (most recent call last):
File "object_detection\eval.py", line 161, in
tf.app.run()
File "C:\Users\robin\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "object_detection\eval.py", line 157, in main
FLAGS.checkpoint_dir, FLAGS.eval_dir)
File "C:\Users\robin\models\object_detection\evaluator.py", line 211, in evaluate
save_graph_dir=(eval_dir if eval_config.save_graph else ''))
File "C:\Users\robin\models\object_detection\eval_util.py", line 524, in repeated_checkpoint_run
time.sleep(time_to_next_eval)
KeyboardInterrupt

@RobinBaumann RobinBaumann changed the title Evaluation in Object Detection hangs and behaves strange Evaluation in Object Detection hanging Aug 16, 2017

@skye

This comment has been minimized.

Copy link
Member

commented Aug 16, 2017

@failedmath

This comment has been minimized.

Copy link

commented Aug 18, 2017

I had the same issue. How many images do you have in your validation set? I can assume it is just too many. Try to reduce it to 10 or some other reasonable amount (in the config file) and check if it works. Do you actually see images in Tensorboard?

@RobinBaumann

This comment has been minimized.

Copy link
Author

commented Aug 18, 2017

I have like 63 Images in the validation set and can see the Images in Tensorboard. But it still hangs.

@failedmath

This comment has been minimized.

Copy link

commented Aug 18, 2017

I would say try with just a couple and see if it works. Or try to run on GPU only eval (this is involves some magic to point not to the last checkpoint)

@ncaadam

This comment has been minimized.

Copy link

commented Aug 23, 2017

By default, the eval script runs forever. You'll need to define max_evals in your eval proto in your train config proto.

Thus, it isn't hanging. It is just continually evaluating images, with sleeps in between. The sleep, by default, is 300 seconds.

You can set shuffle to true in your eval_input_reader if you'd like to see different images on your images tab in tensorboard.

@RobinBaumann

This comment has been minimized.

Copy link
Author

commented Aug 29, 2017

@ncaadam so there is no output during the evaluation? The script ran during my lunch break (which was like 45 mins) and there still was no output. And when I look at some code in the eval_util.py I can see
logging information that do not get displayed in my Terminal.

@failedmath

This comment has been minimized.

Copy link

commented Aug 29, 2017

Have you tried with like 5 images? You have to change eval_config, and see them in Tensorboard. Then wait till the new checkpoint appears. The eval.py takes apriori the last checkpoint in training and then as described before checks every 300 seconds if there is a new model saved. If you run evaluation after the training, it will hang because it will wait for the next checkpoint.

@RobinBaumann

This comment has been minimized.

Copy link
Author

commented Aug 29, 2017

Oh I understand it now. So I have to run the eval and train parallel to get a continuous evaluation. The problem I have with this, is that my CUDA crashes because it runs out of GPU memory. I have decreased the size of the test from 20% to 5%, but it still crashes.

@failedmath what exactly does the num_examples parameter? Does it randomly pick the specified amount of images from the test.tfrecord? So can I set this parameter to 5 and leave my test set with a value of 20% of the total data set size or do I have to reduce the overall size of my test set?

@failedmath

This comment has been minimized.

Copy link

commented Aug 29, 2017

@RobinBaumann I run eval on my CPU, so it does not crash. For 10 images the config is:

eval_config: {
num_examples: 10
num_visualizations: 10
eval_interval_secs: 120
}

And the tfrecord file should be then created just from 10 images.

@RobinBaumann

This comment has been minimized.

Copy link
Author

commented Aug 29, 2017

Okay thank you! This approach works for me.

I think it would be useful to add one or two sentences to the documentation or some additional logging information to the evaluation script, just in case someone else struggles on the same problem. I have seen some prior issues that mentioned similar problems.

@liuqi05

This comment has been minimized.

Copy link

commented Sep 6, 2017

@RobinBaumann, Hi, i meet the same problem with you. I download five pictures and save them in directory /models/valimg. And I use the existing faster_r-cnn_resnet101_voc07.config file and only adjusted the paths. Finally, i run python object_detection/eval.py
--logtostderr
--pipeline_config_path=./configfile/faster_rcnn_resnet101_voc07.config
--checkpoint_dir=./faster_rcnn_resnet101_coco_11_06_2017
--eval_dir=./valimg
However, I can see some output in directory valimg
"-rw-r--r-- 1 joseph joseph 211856 9月 6 10:50 events.out.tfevents.1504666240.ubuntu
-rw-r--r-- 1 joseph joseph 297676 9月 6 10:51 events.out.tfevents.1504666266.ubuntu
-rw-r--r-- 1 joseph joseph 185722 9月 6 10:51 events.out.tfevents.1504666290.ubuntu
-rw-r--r-- 1 joseph joseph 374962 9月 6 10:51 events.out.tfevents.1504666318.ubuntu
-rw-r--r-- 1 joseph joseph 233590 9月 6 10:52 events.out.tfevents.1504666342.ubuntu
-rw-r--r-- 1 joseph joseph 234716 9月 6 10:52 events.out.tfevents.1504666368.ubuntu
-rw-r--r-- 1 joseph joseph 312911 9月 6 10:53 events.out.tfevents.1504666393.ubuntu
-rw-r--r-- 1 joseph joseph 399273 9月 6 10:53 events.out.tfevents.1504666421.ubuntu
-rw-r--r-- 1 joseph joseph 373139 9月 6 10:54 events.out.tfevents.1504666447.ubuntu
-rw-r--r-- 1 joseph joseph 1633 9月 6 10:54 events.out.tfevents.1504666472.ubuntu"
but only events.out.tfevents files, so i have to CTRL+C for mAP but nothing else in the other diagramsetc. I try to run tensorboard --logdir =./valimg, it displays nothing. So can you tell me what is wrong with my operation? thank you in advance.

@auroua

This comment has been minimized.

Copy link

commented Sep 6, 2017

Why the evaluation scripts doesn't output any information.
I find lots of logging statements in the run_checkpoint_once function.
Is there any examples about how to caculate roc or map?

@RobinBaumann

This comment has been minimized.

Copy link
Author

commented Sep 7, 2017

@liuqi05 I don't know what is wrong with your configuration. Maybe try launching Tensorboard with a full path to your eval_dir?

Also my directory structure looks like this:

+project/

  +data/ (dataset location)

  +models/

    +faster_r-cnn_resnet101/

      +eval/

      +inference/

      +train/

      +pipeline.config

And I launch tensorboard with the following command ( from project directory):

tensorboard --logdir=./models/faster_r-cnn_resnet101/

Maybe you have to set the parent directory of your valimg/ as logdir for Tensorboard?

@liu6381810

This comment has been minimized.

Copy link

commented Sep 7, 2017

@liuqi05 same problem with you
for train tensorboard displays output but for eval it displays nothing

@cipri-tom

This comment has been minimized.

Copy link

commented Oct 10, 2017

@failedmath

I run eval on my CPU, so it does not crash

How do you restrict it to do that ? 😃

@josifovski

This comment has been minimized.

Copy link

commented Oct 13, 2017

@cipri-tom At the beginning of the eval.py script you can set the CUDA_VISIBLE_DEVICES to empty in the following way:

import os
os.environ["CUDA_VISIBLE_DEVICES"]="" 

Then the evaluation will fall back to cpu execution.

@cipri-tom

This comment has been minimized.

Copy link

commented Oct 13, 2017

@josifovski obviously! Why did I not think of that? >.< I am using CUDA_VISIBLE_DEVICES to set which GPU to use but I didn't think I could set it to null.

Thank you!

@ybsave

This comment has been minimized.

Copy link

commented Nov 1, 2017

Same problem here. Although this could be easily fixed by one line code: logging.basicConfig(level=logging.DEBUG), I suppose Tensorflow people should fix this bug.

@shreyas0906

This comment has been minimized.

Copy link

commented Jan 21, 2018

Can someone please explain how to run the evalutation from scratch? I want to find the mAP and IOU for a dataset which I have. The detection are happening fine, but I have no clue for evaluating the model. For now, all I am trying to do is
python eval.py --logtostderr --checkpoint_dir=data/model.ckpt-5575 --eval_dir=eval_model/ --pipeline_config_path=data/object-detect.pbtxt

and the object-detect.pbtxt contains

item {
id: 1
name: 'ball'
}

Can someone please let me know how to do this.?

@josifovski

This comment has been minimized.

Copy link

commented Jan 21, 2018

@shreyas0906 the pipeline_config_path argument you provide is not correct, you shouldn't provide the pbtxt mapping file but the configuration file that you have used for training, e.g. faster_rcnn_inception_resnet_v2_atrous_coco.config from the provided tensorflow/models/research/object_detection/samples/configs which you have adapted for your purposes. Inside in the config file you should adapt the eval_input_reader element such that it shows to the tf record created from the images you want to evaluate.

@shreyas0906

This comment has been minimized.

Copy link

commented Jan 21, 2018

@josifovski Thank you for the information! But, my next question after running the the eval.py, I am unable to find the mAP and IOU graph. I am able to view the Images and the detection on them, but unable to view the mAP and IOU graph. Can you please let me know how to achieve that.

Greatly appreciate your help!

@Fahimkh

This comment has been minimized.

Copy link

commented Jan 23, 2018

I am facing this problem. Every thing works perfect for ssd _mobilenet even evaluation. For frcnn_resnet evaluation stop at this line
WARNING:root:image 0 does not have groundtruth difficult flag specified

I changed the eval_max to 1 and lower the number of samples still it hangs and dont move forward.
can any body help me solve this issue?

@ssk1991

This comment has been minimized.

Copy link

commented Jan 23, 2018

@Fahimkh interesting, because eval stops after roughly 9-10 images using ssd_mobilenet for me.

@ssk1991

This comment has been minimized.

Copy link

commented Jan 26, 2018

Looks like I've fixed it for myself.

In ssd_mobilenet_v1_pets.config

eval_config: {
num_examples: 186
num_visualizations: 186
}

add the line "num_visualizations" and set it to the number of examples you have (in my case 186).

@mohdsubhi

This comment has been minimized.

Copy link

commented Feb 21, 2018

Hi guys, wonder if you can help me out here... what am I doing wrong?

this is a copy of my cmd prompt://

C:\Users\Sawal\Desktop\models\models-master\research\object_detection>eval.py --logtostderr --pipeline_config_path=training/ssd_mobilenet_v1_pets.config --checkpoint_dir=training --eval_dir=test
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
2018-02-21 20:53:06.664120: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
INFO:tensorflow:Restoring parameters from training\model.ckpt-2541
INFO:tensorflow:Restoring parameters from training\model.ckpt-2541
WARNING:root:image 0 does not have groundtruth difficult flag specified
WARNING:root:The following classes have no ground truth examples: [2 4]
C:\Users\Sawal\Desktop\models\models-master\research\object_detection\utils\metrics.py:144: RuntimeWarning: invalid value encountered in true_divide
num_images_correctly_detected_per_class / num_gt_imgs_per_class)

then nothing comes out .. continuous loop.. any help is appreciated

@cipri-tom

This comment has been minimized.

Copy link

commented Feb 21, 2018

@mohdsubhi nothing is supposed to come out, IIRC. The evaluation results are published in summaries, which you can see in TensorBoard

@mohdsubhi

This comment has been minimized.

Copy link

commented Feb 21, 2018

@cipri-tom so basically I have to run the evaluation at the same time with training right ? or it doesn't matter?

@cipri-tom

This comment has been minimized.

Copy link

commented Feb 22, 2018

preferably, yes, so you can get continuous evaluation and see how your model progresses

@evolu8

This comment has been minimized.

Copy link

commented Feb 24, 2018

@cipri-tom "The evaluation results are published in summaries, which you can see in TensorBoard"
Where? I'm not seeing anything useful in tensorboard. I see some images, and the graph. I want to see metrics. I'm running eval.py

Many thanks in advance.

@tanndx17

This comment has been minimized.

Copy link

commented Feb 28, 2018

@cipri-tom I had the same problem. Can you tell me what suppose to be the output file in eval directory? I only had several even.out.tfevetsXXXX file and a pipeline.config, should be an eval.pbtxt file as output there?
Thank you!!

@cipri-tom

This comment has been minimized.

Copy link

commented Feb 28, 2018

no, there's only *tfevents files. These contain tensorBoard summaries. Fire up tensorboard --logdir /path/to/eval/dir

@priyakansal

This comment has been minimized.

Copy link

commented May 18, 2018

Hi,
I badly need help. I make the following changes in the sample config file:
train_config:
initial_learning_rate: 0.001 (# previously it was set to 0.004)
eval_config:
max_evals: 0 (previously it was set to 10)
To start with I ran the pretrained model exactly with the same configuration, Both the training job and evaluation job ran smoothly. But since, max_eval was set to 10, the evaluation job stopped. I want to run it for infinite time so that I can get the mAP measure. So, I changed it to 0 and trying to run it again. train.py is running smoothly, but eval.py is not running. Here is the copy of prompt:

/workspace/models/research/object_detection/utils/visualization_utils.py:25: UserWarning:
This call to matplotlib.use() has no effect because the backend has already
been chosen; matplotlib.use() must be called before pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

The backend was originally set to 'TkAgg' by the following code:
File "eval.py", line 50, in
from object_detection import evaluator
File "/workspace/models/research/object_detection/evaluator.py", line 24, in
from object_detection import eval_util
File "/workspace/models/research/object_detection/eval_util.py", line 28, in
from object_detection.metrics import coco_evaluation
File "/workspace/models/research/object_detection/metrics/coco_evaluation.py", line 20, in
from object_detection.metrics import coco_tools
File "/workspace/models/research/object_detection/metrics/coco_tools.py", line 47, in
from pycocotools import coco
File "/workspace/models/research/pycocotools/coco.py", line 49, in
import matplotlib.pyplot as plt
File "/usr/local/lib/python2.7/dist-packages/matplotlib/pyplot.py", line 71, in
from matplotlib.backends import pylab_setup
File "/usr/local/lib/python2.7/dist-packages/matplotlib/backends/init.py", line 16, in
line for line in traceback.format_stack()

import matplotlib; matplotlib.use('Agg') # pylint: disable=multiple-statements
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
2018-05-18 10:31:34.033209: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled t
o use: AVX2 FMA
2018-05-18 10:31:34.744555: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but ther
e must be at least one NUMA node, so returning NUMA node zero
2018-05-18 10:31:34.744967: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
totalMemory: 15.78GiB freeMemory: 5.58GiB
2018-05-18 10:31:34.745000: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-05-18 10:31:35.262651: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-05-18 10:31:35.262739: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-05-18 10:31:35.262767: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-05-18 10:31:35.263065: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 w
ith 5338 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
INFO:tensorflow:Restoring parameters from /workspace/models/research/object_detection/my_ssd_mobilenet_v2_coco_2017/train_check_point_exp4/model.ckpt-6737
INFO:tensorflow:Restoring parameters from /workspace/models/research/object_detection/my_ssd_mobilenet_v2_coco_2017/train_check_point_exp4/model.ckpt-6737
WARNING:root:image 0 does not have groundtruth difficult flag specified

@cipri-tom

This comment has been minimized.

Copy link

commented May 18, 2018

@priyakansal it is working, but you have warnings. Please use stackOverflow for support and add details regarding the mAP that you see in TensorBoard in both cases. I'll try to follow there

@bermeitinger-b

This comment has been minimized.

Copy link

commented Jun 18, 2018

I have a similar problem. I'm using the Faster-RCNN Inception V2 pre-trained model and trained it on my own dataset with 24 classes.
When I run the eval.py script with the correct parameters, the process will eat up all available memory (I put limits to 12GB, 48GB, and 128GB; I do not have more) and then just hang at INFO:tensorflow:Restoring parameters from /data/train/model.ckpt-200000 for a few seconds and then just quit without further notification.
The eval dir contains only the pipeline.cfg file.

My test.record only has 34 images that are relatively small, so I don't think the size should matter. Also, it doesn't matter if I allow access to 0, 1, 2, or 3 CUDA cards.

Additionally, for training, it worked after giving it access to 128GB and it used all of it. I think there is something wrong with the memory consumption.

I'm running it inside the docker container of nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 using Python 3.5. I did not have any success with tensorflow/tensorflow:latest-gpu with either Python 2 or Python 3.

@cipri-tom

This comment has been minimized.

Copy link

commented Jun 19, 2018

@bermeitinger-b If you run out of RAM (as opposed to GPU memory) see here . It's a problem of queues.

@bermeitinger-b

This comment has been minimized.

Copy link

commented Jun 20, 2018

I don't think so, I reduced it to one image per batch, a queue size of one and a prefetch size of one.
I have added some print statements to check where it crashes, and it is in the coco_evaluation.py file, so after the network was loaded and the image has already been processed.
Although, it could be of the last thing that didn't fit into memory. I have set a limit of 124GB.
I could train successfully and also run inference on many images, I do not know why the evaluation requires so much memory.

@dragan-apostolski

This comment has been minimized.

Copy link

commented Jun 20, 2018

So did anyone found solutions on how to make tensorboard show the scalars tab to see the evaluation metrics?

@Cheren15

This comment has been minimized.

Copy link

commented Jun 22, 2018

@priyakansal I also have the same problem and spent a whole to find the solution then it turn out that I just need a little bit patience.
@cipri-tom Thank you for your remainder, now I have my evaluation result and do you have a idea how to explain that i got classification_loss about 8 while localization_loss close 1.

@leccyril

This comment has been minimized.

Copy link

commented Jul 8, 2018

where is explanations from localization_loss, classificatio, objectness... could i mix on tensorboard the training va and evaluation images values ?

@alvinxiii

This comment has been minimized.

Copy link

commented Aug 5, 2018

I can only see image and graph tab. How to check the mAP values in scalar tab?

@leccyril

This comment has been minimized.

Copy link

commented Aug 5, 2018

make train for example in /train and eval in /train/eval, then launch tensorboard with path /train, it will take both log events

@cjr0106

This comment has been minimized.

Copy link

commented Sep 29, 2018

anyone run the problem in eval.py:
Caused by op 'save/RestoreV2', defined at:
File "eval.py", line 147, in
tf.app.run()
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\util\deprecation.py", line 272, in new_func
return func(*args, **kwargs)
File "eval.py", line 143, in main
graph_hook_fn=graph_rewriter_fn)
File "D:\tensorflow\tf.models\models\research\object_detection\legacy\evaluator.py", line 251, in evaluate
saver = tf.train.Saver(variables_to_restore)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 1281, in init
self.build()
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 1293, in build
self._build(self._filename, build_save=True, build_restore=True)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 1330, in _build
build_save=build_save, build_restore=build_restore)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 778, in _build_internal
restore_sequentially, reshape)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 397, in _AddRestoreOps
restore_sequentially)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 829, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\ops\gen_io_ops.py", line 1546, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\util\deprecation.py", line 454, in new_func
return func(*args, **kwargs)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\framework\ops.py", line 3155, in create_op
op_def=op_def)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\framework\ops.py", line 1717, in init
self._traceback = tf_stack.extract_stack()

NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from
the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key lr not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64, DT_FLOAT],
_device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.