Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

evaluating the trained model performance based on ckpt file #19

Open
qiaomai89 opened this issue Jan 11, 2019 · 14 comments
Open

evaluating the trained model performance based on ckpt file #19

qiaomai89 opened this issue Jan 11, 2019 · 14 comments

Comments

@qiaomai89
Copy link

Hi,

thanks for sharing your code, which helps a lot.

But there is a problem that when we run the train.py, three ckpt files are saved. But how can we run these model files to test performance?

I have tried two ways: **one is to use the test.py,** but it says 

Traceback (most recent call last):
File "test.py", line 28, in
saver = tf.train.Saver()
File "/home/suhuiqiao/anaconda3/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1293, in init
self.build()
File "/home/suhuiqiao/anaconda3/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1302, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/home/suhuiqiao/anaconda3/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1327, in _build
raise ValueError("No variables to save")
ValueError: No variables to save

Another way I tried convert_weight.py --ckpt_file file --freeze, but it says
Traceback (most recent call last):
File "/home/suhuiqiao/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call
return fn(*args)
File "/home/suhuiqiao/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
target_list, status, run_metadata)
File "/home/suhuiqiao/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [255] rhs shape= [33]
[[Node: save/Assign_349 = Assign[T=DT_FLOAT, _class=["loc:@yolov3/yolo-v3/Conv_6/biases"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](yolov3/yolo-v3/Conv_6/biases, save/RestoreV2/_149)]]

Could you help me with this problem? Thanks a lot!

@qiaomai89
Copy link
Author

@YunYang1994 Hey, could you give some advice towards to these issues?

@forwardwfg
Copy link

Have you solved it? i also meet same problem

@forwardwfg
Copy link

add _ = tf.Variable(initial_value='fake_variable') before saver = tf.train.Saver(), it works in my codes. You can try

@qiaomai89
Copy link
Author

@WeifaGan Thanks a lot for sharing!

I have tried to add " add _ = tf.Variable(initial_value='fake_variable') " before saver = tf.train.Saver(), but there is an error:

NotFoundError (see above for traceback): Key Variable not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_STRING], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
[[Node: save/Assign/_2 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_7_save/Assign", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

Here is my question. I use the previous version of train.py, it can run and create
models, but when I run test.py, it rises the error as above.

But when I use the latest version of train.py, it shows
loss_class += result[3]
IndexError: tuple index out of range
and for this one, I tried the one
"commented this in line 338

return object_mask, intersect_area, iou_scores"

the error comes:
"InvalidArgumentError (see above for traceback): ValueError: could not broadcast input array from shape (10,4) into shape (8,4)
[[Node: yolov3/PyFunc_1 = PyFuncTin=[DT_FLOAT], Tout=[DT_FLOAT], token="pyfunc_2", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]"

could you share your train.py? give some advice?
Thanks again!

@forwardwfg
Copy link

I met the errors you mentioned above.
For the NotFoundError error, you can try the followed codes instead of the code saver.restore(sess, save_path=WEIGHTS_PATH) :
module_file = tf.train.latest_checkpoint(WEIGHTS_PATH)
sess.run(tf.global_variables_initializer())
if module_file is not None:
saver.restore(sess, module_file)
In my codes, it works.
For the IndexError, you can try to # return object_mask, intersect_area, iou_scores, as a matter of fact, you can find that the last line in yolov3.py is the what we need.
And what is previous version and last version of train.py, I just have one version.

@qiaomai89
Copy link
Author

@WeifaGan Hi, you are so helpful and thanks very much.

and for the IndexError, I have tried to # return object_mask, intersect_area, iou_scores,

but the error comes:

=> loading yolov3/darknet-53/Conv_49/BatchNorm/gamma:0
=> loading yolov3/darknet-53/Conv_50/weights:0
=> loading yolov3/darknet-53/Conv_50/BatchNorm/gamma:0
=> loading yolov3/darknet-53/Conv_51/weights:0
=> loading yolov3/darknet-53/Conv_51/BatchNorm/gamma:0
=> EPOCH: 0 total_loss: nan loss_xy: 0.0066 loss_wh: nan loss_conf: 0.9094 loss_class: 0.0058 rec_50: 0.0000 rec_70: 0.0000 avg_iou: 0.0000
=> EPOCH: 1 total_loss: nan loss_xy: nan loss_wh: nan loss_conf: nan loss_class: nan rec_50: 0.0000 rec_70: 0.0000 avg_iou: nan
=> EPOCH: 2 total_loss: nan loss_xy: nan loss_wh: nan loss_conf: nan loss_class: nan rec_50: 0.0000 rec_70: 0.0000 avg_iou: nan
=> EPOCH: 3 total_loss: nan loss_xy: nan loss_wh: nan loss_conf: nan loss_class: nan rec_50: 0.0000 rec_70: 0.0000 avg_iou: nan
=> EPOCH: 4 total_loss: nan loss_xy: nan loss_wh: nan loss_conf: nan loss_class: nan rec_50: 0.0000 rec_70: 0.0000 avg_iou: nan
2019-01-16 00:54:23.044334: W tensorflow/core/framework/op_kernel.cc:1190] Invalid argument: ValueError: could not broadcast input array from shape (10,4) into shape (8,4)
Traceback (most recent call last):
File "/home/suhuiqiao/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call
return fn(*args)
File "/home/suhuiqiao/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
target_list, status, run_metadata)
File "/home/suhuiqiao/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: ValueError: could not broadcast input array from shape (10,4) into shape (8,4)
[[Node: yolov3/PyFunc_1 = PyFuncTin=[DT_FLOAT], Tout=[DT_FLOAT], token="pyfunc_2", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "train.py", line 73, in
run_items = sess.run([train_op, write_op] + loss, feed_dict={is_training:True})
File "/home/suhuiqiao/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/home/suhuiqiao/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "/home/suhuiqiao/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
File "/home/suhuiqiao/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: ValueError: could not broadcast input array from shape (10,4) into shape (8,4)
[[Node: yolov3/PyFunc_1 = PyFuncTin=[DT_FLOAT], Tout=[DT_FLOAT], token="pyfunc_2", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op 'yolov3/PyFunc_1', defined at:
File "train.py", line 44, in
loss = model.compute_loss(y_pred, y_true)
File "/DALAB/DATA1/suhuiqiao/new-tensorflow-yolov3-master/core/yolov3.py", line 256, in compute_loss
result = self.loss_layer(y_pred[i], y_true[i], _ANCHORS[i], ignore_thresh, max_box_per_image)
File "/DALAB/DATA1/suhuiqiao/new-tensorflow-yolov3-master/core/yolov3.py", line 356, in loss_layer
true_boxes = tf.py_func(pick_out_gt_box, [y_true], [tf.float32] )[0]
File "/home/suhuiqiao/anaconda3/lib/python3.5/site-packages/tensorflow/python/ops/script_ops.py", line 317, in py_func
func=func, inp=inp, Tout=Tout, stateful=stateful, eager=False, name=name)
File "/home/suhuiqiao/anaconda3/lib/python3.5/site-packages/tensorflow/python/ops/script_ops.py", line 225, in _internal_py_func
input=inp, token=token, Tout=Tout, name=name)
File "/home/suhuiqiao/anaconda3/lib/python3.5/site-packages/tensorflow/python/ops/gen_script_ops.py", line 93, in _py_func
"PyFunc", input=input, token=token, Tout=Tout, name=name)
File "/home/suhuiqiao/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/suhuiqiao/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
op_def=op_def)
File "/home/suhuiqiao/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1650, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): ValueError: could not broadcast input array from shape (10,4) into shape (8,4)
[[Node: yolov3/PyFunc_1 = PyFuncTin=[DT_FLOAT], Tout=[DT_FLOAT], token="pyfunc_2", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

You can see there are two problems: one is the training loss equals to nan, another one is the shape error.

do you have some advice?
Thanks!!

@forwardwfg
Copy link

@qiaomai89
For the first error, I changed "boxes = tf.concat([box_centers, box_sizes], axis=-1)" to "boxes =tf.concat([box_centers-box_sizes/2,box_centers + box_sizes/2], axis = -1)".
If you track the error, you can find that the code "pred_box_wh = pred_boxes[..., 2:4] - pred_boxes[..., 0:2] " in line 302 of yolov3,py causes the value pred_box_wh to be negative, so, it is nan in log funciton. If the statement "# pred_boxes 前面两个坐标是左上角,后面两个是右下角" is ture, the value pred_box_wh will be positive. Thus, pred_boxes not meets the above statement. My change let the statement to be ture. It can run but I am not sure it's very very very.... absolute correct.

For the second issue, I changed "true_boxes_batch[i][0][0][0][0:len(true_boxes_per_layer)] = true_boxes_per_layer " in line about 370 of yolo3.py as follows:
if len(true_boxes_per_layer)<max_box_per_image :
true_boxes_batch[i][0][0][0][0:len(true_boxes_per_layer)] = true_boxes_per_layer
else :
true_boxes_batch[i][0][0][0][0:len(true_boxes_per_layer)] =
true_boxes_per_layer[0:max_box_per_image]
The reason is that "true_boxes_batch = np.zeros([bs, 1, 1, 1, max_box_per_image, 4], dtype=np.float32)" in line about 357 limits the shape. off course, I am also not sure it's very very very.... absolute correct, but, mostly, it's correct.

After modifing, I think you can run.

@qiaomai89
Copy link
Author

@WeifaGan you are really sooooooooo nice, kind,patient and helpful! I really appreciate your help! now I can run both training and testing.

By the way, what is your model performance? is it equal to the one trained by the codes provided by the author(darknet) in the same data set?

@forwardwfg
Copy link

@qiaomai89
Hi, guy, you are welcome. I just trained little time yesterday and the result is not satisfied. I will train it in the next few days. And help more communication.

@qiaomai89
Copy link
Author

@WeifaGan Hi, I am training both two frames(darknet and tf) in the same data set, if there is any result, I will let you know. And if there is anything you find, pls share with me. Thanks!

@qiaomai89
Copy link
Author

@WeifaGan could you do testing on ckpt files?

I have tried two ways: first, running convert_weight.py, and I get three pb files. Then running nms_demo.py to see the jpg results, but there is nothing on the picture, and there is the log:
=> nms on gpu the number of boxes= 0 time=5239.83 ms
=> nms on gpu the number of boxes= 0 time=42.90 ms
=> nms on gpu the number of boxes= 0 time=37.91 ms
=> nms on gpu the number of boxes= 0 time=36.61 ms
=> nms on gpu the number of boxes= 0 time=40.40 ms

Another one I have tried: running test.py on tfrecords data, and there is the log:
=> EPOCH: 0 rec:0.00 prec:0.00 mAP:0.00
=> EPOCH: 1 rec:0.00 prec:0.00 mAP:0.00
=> EPOCH: 2 rec:0.00 prec:0.00 mAP:0.00
=> EPOCH: 3 rec:0.00 prec:0.00 mAP:0.00
=> EPOCH: 4 rec:0.00 prec:0.00 mAP:0.00
=> EPOCH: 5 rec:0.00 prec:0.00 mAP:0.00
=> EPOCH: 6 rec:0.00 prec:0.00 mAP:0.00
=> EPOCH: 7 rec:0.00 prec:0.00 mAP:0.00
=> EPOCH: 8 rec:0.00 prec:0.00 mAP:0.00
=> EPOCH: 9 rec:0.00 prec:0.00 mAP:0.00
=> EPOCH: 10 rec:0.00 prec:0.00 mAP:0.00
=> EPOCH: 11 rec:0.00 prec:0.00 mAP:0.00
=> EPOCH: 12 rec:0.00 prec:0.00 mAP:0.00
=> EPOCH: 13 rec:0.00 prec:0.00 mAP:0.00
=> EPOCH: 14 rec:0.00 prec:0.00 mAP:0.00
=> EPOCH: 15 rec:0.00 prec:0.00 mAP:0.00
=> EPOCH: 16 rec:0.00 prec:0.00 mAP:0.00
=> EPOCH: 17 rec:0.00 prec:0.00 mAP:0.00
=> EPOCH: 18 rec:0.00 prec:0.00 mAP:0.00
=> EPOCH: 19 rec:0.00 prec:0.00 mAP:0.00

Any advice about this one?

@forwardwfg
Copy link

@qiaomai89
Trying to lower the score_thresh in evaluate function in utils.py, you will see some map which is not zero but small value. I think that there are some problems with codes. I run the train.py, but I found that the total loss always floats in the range of about 0.1 to 2, not trending to convergent. So I try another code https://github.com/aloyschen/tensorflow-yolo3. It trends to convergent at least. I try to train it now. I recommend that you try this also.

For convenient communication, I think we can add QQ or Wechat.

@qiaomai89
Copy link
Author

@WeifaGan you can add me wechat: 455741772

@stevenwuaggie507
Copy link

@WeifaGan Hi: I follow your suggestion code change and also lower score_thresh =0.1 but I still got the same result as @qiaomai89 did:

l GPU (device: 0, name: TITAN V, pci bus id: 0000:65:00.0, compute capability: 7.0)
=> nms on gpu the number of boxes= 0 time=4340.55 ms
=> nms on gpu the number of boxes= 0 time=28.67 ms
=> nms on gpu the number of boxes= 0 time=27.60 ms
=> nms on gpu the number of boxes= 0 time=28.12 ms
=> nms on gpu the number of boxes= 0 time=29.70 ms

do you solve this issue?
Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants