how to train it on my own dataset #3

derek-zr · 2018-04-11T03:26:05Z

hi! I want to train cascade-rcnn on my own dataset (three classes). I don't know how to modify the files(eg. examples/voc/). Can you give me some instructions? Thank you!

makefile · 2018-04-26T04:00:35Z

Hi, When I train the models such as res50-12s-600-rfcn-cascade without FPN with my own dataset is fine. But when I try to train res50-15s-800-fpn-cascade with my own dataset, I meet the problem that decode_bbox_layer cannot get valid bbox. After the code of "screen out high IoU boxes, to remove redundant gt boxes" the valid_bbox_ids is 0.
So, what the problem might be? Thanks. @zhaoweicai

zhaoweicai · 2018-04-26T19:00:39Z

@makefile If you don't want to remove the redundant gt boxes, you can simply set gt_iou_thr=1.0 or higher. But a more important problem is you might not have enough proposals. In your case of error, only gt boxes and no negative box. You can try to lower the proposal threshold in "BoxGroupOutput" layer to have more proposals. Or your training is diverging and crashed. You can also try to use a lower learning rate.

makefile · 2018-04-27T04:39:52Z

@zhaoweicai Thanks! Follow your advice, set lower the fg_thr in BoxGroupOutput layer, the problem disappeared.

Peng-wei-Yu · 2018-06-02T12:25:10Z

@zhaoweicai @makefile I try to train cascade rcnn on my own dataset, and I got this problem, I tried to lower the iou_thr in "BoxGroupOutput" layer but the problem still there, can you give me any suggestion.

jwnsu · 2018-06-03T01:19:28Z

The error seems related to multiple gpus. When I tried single gpu (not all GPU ids, gpu id 1 is fine, but gpu id 2 encounters same above error), training proceeds; however, with 2 gpus, encountered same above error.

makefile · 2018-06-03T06:40:47Z

@Peng-wei-Yu try lower the score of fg_thr instead of nms thresh.

Peng-wei-Yu · 2018-06-03T13:07:42Z

@jwnsu @makefile Thank you for you help. But I tried to lower fg_thr and use only GPU 1, the problem is still there. Have you tried to change the --weights in train_detection, I decided to change the caffemodel and have a try.

jwnsu · 2018-06-03T16:07:50Z

FYI. coco model seems to work fine (e.g. coco/res50-15s-800-fpn-cascade is fine, res101 runs out of GPU memory on 1080 Ti), suggest you switch to coco flavor from voc.

zhaoweicai · 2018-06-03T19:00:27Z

@Peng-wei-Yu when you change the number of GPUs, you should change the learning rate at the same time, as described in the paper.

zhaoweicai · 2018-06-03T19:04:12Z

@jwnsu The code should have no problem on multi-gpu training or VOC dataset. Try the run the script a couple of times to see if the problem still happens. If the problem is still there, try to lower the learning rate a little bit. If it still cannot be fixed, maybe there is something wrong.

Peng-wei-Yu · 2018-06-04T08:55:41Z

@makefile @zhaoweicai When you trained cascade rcnn on your own data, which caffemodel did you use. Your own caffemodel or ResNet-50-model-merge.caffemodel. The picture in my own data have the size of 1600*1200, should I change the short_size and long_size in train.prototxt.

makefile · 2018-06-04T09:47:28Z

@Peng-wei-Yu If you use the author's prototxt, you should use the corresponding ResNet-50-model-merge.caffemodel, since it merges the BN layer to scale layer to reduce memory and speed up. You can increase the input size of image if your memory is enough, but the result may not increase too much.

Peng-wei-Yu · 2018-06-04T10:51:46Z

@makefile Thank you very much. I'll have a try by using ResNet-50-model-merge.caffemodel.

GuoxingYan · 2018-06-07T14:49:28Z

@makefile @Peng-wei-Yu in BoxGroupOutput layer,the original setting is 0.001, you finally set it?

GuoxingYan · 2018-06-08T03:01:05Z

@makefile @Peng-wei-Yu
When I was training, batchsize was equal to 1. There was at least one sample in my own training pictures, but Why is total positive equal to 0 in many iterations during the training process?and my rpn loss is 0.Have you encountered such a problem?

makefile · 2018-06-08T04:31:35Z

@GuoxingYan I set fg_thr: 0.01 or 0 in all BoxGroupOutput layer. If your positive rois num is always 0, maybe your dataset has some problem.

GuoxingYan · 2018-06-20T03:31:42Z

@makefile Did you try to change the short_size and long_size in train.prototxt？when i only changed the short_size or long_size ,There will be an error。

makefile · 2018-06-20T07:47:49Z

@GuoxingYan I did not try to change that, since there use Deconvolution layer to upsample, the size maybe need to be multiplier of 32, 64 or larger.

GuoxingYan · 2018-06-20T08:31:32Z

@makefile thank you very much！！

GuoxingYan · 2018-06-21T01:06:18Z

@makefile Will you have the following problems when training fpn?

makefile · 2018-06-21T13:40:10Z

@GuoxingYan I didn't met. the integer seems to be abnormal big.

licy5152 · 2018-06-24T07:43:01Z

@Peng-wei-Yu @zhaoweicai my own data size is 960*1280,I try to use the ResNet-50-model-merge.caffemodel, but I also get this problem.

GuoxingYan · 2018-06-30T04:20:25Z

@makefile @zhaoweicai @Peng-wei-Yu When I was training, I found that the short_size in detection_data_param in trian.prototxt is 800, which is exactly equal to img_width and img_height in proposal_target_param. So the question arises. When I change the short_size to 320, does the img_width and img_height need to be changed to 320?

makefile · 2018-06-30T05:43:37Z

@GuoxingYan I think it needs to be.

licy5152 · 2018-06-30T06:55:08Z

@makefile I use to train my owe dataset,how can I get the output for every picture?

makefile · 2018-07-02T01:58:28Z

@licy5152 I wrote a python script CascadeRCNN-demo.py imitate the matlab code, you can modify it to use.

GuoxingYan · 2018-07-02T06:55:26Z

@makefile 你的demo.py 显示无效链接诶。

makefile · 2018-07-02T08:18:29Z

@GuoxingYan 你的网络问题吧

PacteraKun · 2018-07-08T06:28:26Z

@makefile @zhaoweicai
When I was training my own dataset, the following issue happened. However, I have already check that there is no box has xmin = 1664 and xmax = 636 in the window_file.txt. And I also have not found bbox_util.cpp file under the workspace directory. Could you guys help me to solve this issue? Thanks a lot.

makefile · 2018-07-09T02:51:57Z

@PacteraKun The situation you encountered is unusual, check carefully.

PacteraKun · 2018-07-09T05:17:03Z

@makefile
Have you use cascade-rcnn to train your own dataset successfully?

makefile · 2018-07-09T10:14:58Z

@PacteraKun I once trained several model, but failed to visualize the demo result. Later I transplant it to my own familiar framework to use.

lzh19961031 · 2018-08-06T10:01:26Z

@makefile 请问下，你test那个python文件中的labelmap_file是什么呢？

DetectionIIT · 2018-08-07T02:18:42Z

@GuoxingYan @zhaoweicai
I meet the same error?have you solved?
there are many params are equal to -1 and can't save the model??

I0806 23:44:24.048591 20123 solver.cpp:219] Iteration 9900 (2.14913 iter/s, 46.5305s/100 iters), loss = 0.440841
I0806 23:44:24.048627 20123 solver.cpp:238] Train net output #0: bbox_iou = -1
I0806 23:44:24.048635 20123 solver.cpp:238] Train net output #1: bbox_iou_2nd = -1
I0806 23:44:24.048638 20123 solver.cpp:238] Train net output #2: bbox_iou_3rd = -1
I0806 23:44:24.048641 20123 solver.cpp:238] Train net output #3: bbox_iou_pre = -1
I0806 23:44:24.048645 20123 solver.cpp:238] Train net output #4: bbox_iou_pre_2nd = -1
I0806 23:44:24.048648 20123 solver.cpp:238] Train net output #5: bbox_iou_pre_3rd = -1
I0806 23:44:24.048651 20123 solver.cpp:238] Train net output #6: cls_accuracy = 0.984375
I0806 23:44:24.048655 20123 solver.cpp:238] Train net output #7: cls_accuracy_2nd = 0.972656
I0806 23:44:24.048658 20123 solver.cpp:238] Train net output #8: cls_accuracy_3rd = 0.964844
I0806 23:44:24.048666 20123 solver.cpp:238] Train net output #9: loss_bbox = 0.0117847 (* 1 = 0.0117847 loss)
I0806 23:44:24.048671 20123 solver.cpp:238] Train net output #10: loss_bbox_2nd = 0.0129223 (* 0.5 = 0.00646114 loss)
I0806 23:44:24.048676 20123 solver.cpp:238] Train net output #11: loss_bbox_3rd = 0.00699362 (* 0.25 = 0.0017484 loss)
I0806 23:44:24.048681 20123 solver.cpp:238] Train net output #12: loss_cls = 0.0294972 (* 1 = 0.0294972 loss)
I0806 23:44:24.048686 20123 solver.cpp:238] Train net output #13: loss_cls_2nd = 0.0663875 (* 0.5 = 0.0331937 loss)
I0806 23:44:24.048689 20123 solver.cpp:238] Train net output #14: loss_cls_3rd = 0.0622066 (* 0.25 = 0.0155517 loss)
I0806 23:44:24.048696 20123 solver.cpp:238] Train net output #15: rpn_accuracy = 0.999953
I0806 23:44:24.048701 20123 solver.cpp:238] Train net output #16: rpn_accuracy = -1
I0806 23:44:24.048703 20123 solver.cpp:238] Train net output #17: rpn_bboxiou = -1
I0806 23:44:24.048708 20123 solver.cpp:238] Train net output #18: rpn_loss = 0.000343773 (* 1 = 0.000343773 loss)
I0806 23:44:24.048713 20123 solver.cpp:238] Train net output #19: rpn_loss = 0 (* 1 = 0 loss)
I0806 23:44:24.048717 20123 sgd_solver.cpp:105] Iteration 9900, lr = 0.0002
I0806 23:45:10.848093 20123 solver.cpp:587] Snapshotting to binary proto file /disk1/g201708021059/cascade-rcnn/examples/voc/res101-9s-600-rfcn-cascade/log/cascadercnn_voc_iter_10000.caffemodel
*** Aborted at 1533570310 (unix time) try "date -d @1533570310" if you are using GNU date ***
PC: @ 0x7f55674532e7 caffe::Layer<>::ToProto()
*** SIGSEGV (@0x0) received by PID 20123 (TID 0x7f55682b49c0) from PID 0; stack trace: ***
@ 0x7f5565dedcb0 (unknown)
@ 0x7f55674532e7 caffe::Layer<>::ToProto()
@ 0x7f55675d7533 caffe::Net<>::ToProto()
@ 0x7f55675f415f caffe::Solver<>::SnapshotToBinaryProto()
@ 0x7f55675f42f2 caffe::Solver<>::Snapshot()
@ 0x7f55675f7f7a caffe::Solver<>::Step()
@ 0x7f55675f8994 caffe::Solver<>::Solve()
@ 0x40d4c0 train()
@ 0x408d32 main
@ 0x7f5565dd8f45 (unknown)
@ 0x409442 (unknown)
@ 0x0 (unknown)

GuoxingYan · 2018-08-07T07:44:07Z

@Emmra https://blog.csdn.net/e01528/article/details/80913443 希望能帮到你，可以的话，帮忙点个赞。

GuoxingYan · 2018-08-07T07:44:41Z

@Emmra 保存不了caffemodel的问题我没有遇到

GuoxingYan · 2018-08-07T07:45:38Z

@lzh19961031 那个检测的python你能打开吗？
我这边试了好几次没有打开，方便的话可以给我发一下吗？
921905071@qq.com

huinsysu · 2018-08-11T09:40:20Z

@makefile @GuoxingYan @licy5152 请问，在train.prototext文件中的long_size和short_size的作用是什么呢？我得数据集中有的图片长宽分别为6000和4000,我需要在设置这两个参数为6000和4000吗？谢谢！

huinsysu · 2018-08-11T11:48:46Z

@makefile @zhaoweicai Hi,I try to lower the fg_th in BoxGroupOutput layer, but I still get the problem of keep_num > 0(0 vs. 0). Could I just set the fg_th to 0 for all fg_th in BoxGroupOutput layer? Thanks for help me!

makefile · 2018-08-12T07:02:52Z

@huinsysu the size is about input resize. 6000x4000 maybe too large to fit into 1 gpu.

huinsysu · 2018-08-15T09:01:38Z

@licy5152 Hi, when I trained the model with my own dataset, I met the same error as you met. Would you please tell me how you solve such problem? Thanks！

lininglouis · 2018-08-26T08:19:08Z

@zhaoweicai May I know the intuition of using fg_thr ( or when the cls_score is 0.99 or higher) to filter the bboxes? It seems that you drop all those bboxes. ( they dont even get into the nms_by_cls_score or proposal stage). So why drop the bbox whose cls_score is higher than 0.99 by default?

elgong · 2018-10-10T05:20:51Z

网络能正常训练了，但是每次 Ctrl +c 终止程序，会出现 “irq/132-nvidia”的root进程，cpu100%占用，内存占用0，重新执行训练会卡在最开始的地方，Nvidia-smi也卡住了：

5242 root -51 0 0 0 0 R 100.0 0.0 29:08.18 irq/132-nvidia
必须重启才能解决，请问您遇到过这个状况吗？

hu5tao · 2018-12-03T03:43:34Z

@makefile At last,are you satisfied with you results about your datasets? I am preparing for train my dataset in my datasets.

makefile · 2018-12-03T05:59:11Z

@hu5tao not bad.

qianfangjj · 2018-12-19T13:44:43Z

@licy5152 I wrote a python script CascadeRCNN-demo.py imitate the matlab code, you can modify it to use.

@makefile 你好，我试了好几次都不能打开 CascadeRCNN-demo.py的链接，请问你是否方便发给我一份？422246019@qq.com 谢谢了！

foralliance · 2018-12-27T07:00:08Z

@lininglouis
0.99??
I think its just drop cls_score is lower than 0.01

leizhu1989 · 2019-06-04T06:14:15Z

when I train my own data ,it has a error,but I don't know why,could you give me some ideas? Thanks a lot

I0604 13:28:15.270220 87804 detection_data_layer.cpp:142] num: 0 /home/zhulei/data/VOCdevkit/VOC2007/JPEGImages/IMG_0_112.jpg 3 1080 1920 windows to process: 36, RONI windows: 0
F0604 13:28:15.274016 87804 detection_data_layer.cpp:123] Check failed: label > 0 (0 vs. 0)
*** Check failure stack trace: ***
@ 0x7fb962af05cd google::LogMessage::Fail()
@ 0x7fb962af2433 google::LogMessage::SendToLog()
@ 0x7fb962af015b google::LogMessage::Flush()
@ 0x7fb962af2e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7fb963271781 caffe::DetectionDataLayer<>::DataLayerSetUp()
@ 0x7fb9631c27d5 caffe::BasePrefetchingDataLayer<>::LayerSetUp()
@ 0x7fb96338b6a2 caffe::Net<>::Init()
@ 0x7fb96338dd0e caffe::Net<>::Net()
@ 0x7fb963312515 caffe::Solver<>::InitTrainNet()
@ 0x7fb963312aa4 caffe::Solver<>::Init()
@ 0x7fb963312d8f caffe::Solver<>::Solver()
@ 0x7fb963335701 caffe::Creator_SGDSolver<>()
@ 0x40d912 train()
@ 0x408795 main
@ 0x7fb961335830 __libc_start_main
@ 0x4090a9 _start
@ (nil) (unknown)

jwnsu mentioned this issue Jun 3, 2018

multi-gpu training error #18

Closed

This was referenced Aug 7, 2018

Train error： many params are -1, can't save the trained model #47

Open

when i try to train the res101-9s-600-rfcn-cascade detector using my gpus 4,5 , it said #39

Closed

xiaoxiongli mentioned this issue Sep 20, 2018

during training, why so many "-1" in the training log? #58

Closed

zhaoweicai mentioned this issue Oct 22, 2018

decode_bbox_layer, keep_num check error #59

Closed

how to train it on my own dataset #3

how to train it on my own dataset #3

Comments

derek-zr commented Apr 11, 2018

makefile commented Apr 26, 2018

zhaoweicai commented Apr 26, 2018

makefile commented Apr 27, 2018

Peng-wei-Yu commented Jun 2, 2018

jwnsu commented Jun 3, 2018 • edited

makefile commented Jun 3, 2018

Peng-wei-Yu commented Jun 3, 2018

jwnsu commented Jun 3, 2018 • edited

zhaoweicai commented Jun 3, 2018

zhaoweicai commented Jun 3, 2018

Peng-wei-Yu commented Jun 4, 2018

makefile commented Jun 4, 2018 • edited

Peng-wei-Yu commented Jun 4, 2018

GuoxingYan commented Jun 7, 2018 • edited

GuoxingYan commented Jun 8, 2018

makefile commented Jun 8, 2018

GuoxingYan commented Jun 20, 2018

makefile commented Jun 20, 2018

GuoxingYan commented Jun 20, 2018

GuoxingYan commented Jun 21, 2018

makefile commented Jun 21, 2018

licy5152 commented Jun 24, 2018

GuoxingYan commented Jun 30, 2018

makefile commented Jun 30, 2018

licy5152 commented Jun 30, 2018

makefile commented Jul 2, 2018

GuoxingYan commented Jul 2, 2018

makefile commented Jul 2, 2018

PacteraKun commented Jul 8, 2018

makefile commented Jul 9, 2018

PacteraKun commented Jul 9, 2018

makefile commented Jul 9, 2018

lzh19961031 commented Aug 6, 2018

DetectionIIT commented Aug 7, 2018

GuoxingYan commented Aug 7, 2018

GuoxingYan commented Aug 7, 2018

GuoxingYan commented Aug 7, 2018

huinsysu commented Aug 11, 2018

huinsysu commented Aug 11, 2018

makefile commented Aug 12, 2018

huinsysu commented Aug 15, 2018

lininglouis commented Aug 26, 2018

elgong commented Oct 10, 2018 • edited

hu5tao commented Dec 3, 2018

makefile commented Dec 3, 2018

qianfangjj commented Dec 19, 2018

foralliance commented Dec 27, 2018

leizhu1989 commented Jun 4, 2019

jwnsu commented Jun 3, 2018 •

edited

jwnsu commented Jun 3, 2018 •

edited

makefile commented Jun 4, 2018 •

edited

GuoxingYan commented Jun 7, 2018 •

edited

elgong commented Oct 10, 2018 •

edited