Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to train it on my own dataset #3

Open
derek-zr opened this issue Apr 11, 2018 · 48 comments
Open

how to train it on my own dataset #3

derek-zr opened this issue Apr 11, 2018 · 48 comments

Comments

@derek-zr
Copy link

hi! I want to train cascade-rcnn on my own dataset (three classes). I don't know how to modify the files(eg. examples/voc/). Can you give me some instructions? Thank you!

@makefile
Copy link

Hi, When I train the models such as res50-12s-600-rfcn-cascade without FPN with my own dataset is fine. But when I try to train res50-15s-800-fpn-cascade with my own dataset, I meet the problem that decode_bbox_layer cannot get valid bbox. After the code of "screen out high IoU boxes, to remove redundant gt boxes" the valid_bbox_ids is 0.
So, what the problem might be? Thanks. @zhaoweicai

@zhaoweicai
Copy link
Owner

@makefile If you don't want to remove the redundant gt boxes, you can simply set gt_iou_thr=1.0 or higher. But a more important problem is you might not have enough proposals. In your case of error, only gt boxes and no negative box. You can try to lower the proposal threshold in "BoxGroupOutput" layer to have more proposals. Or your training is diverging and crashed. You can also try to use a lower learning rate.

@makefile
Copy link

@zhaoweicai Thanks! Follow your advice, set lower the fg_thr in BoxGroupOutput layer, the problem disappeared.

@Peng-wei-Yu
Copy link

@zhaoweicai @makefile I try to train cascade rcnn on my own dataset, and I got this problem, I tried to lower the iou_thr in "BoxGroupOutput" layer but the problem still there, can you give me any suggestion.
wenti

@jwnsu
Copy link

jwnsu commented Jun 3, 2018

The error seems related to multiple gpus. When I tried single gpu (not all GPU ids, gpu id 1 is fine, but gpu id 2 encounters same above error), training proceeds; however, with 2 gpus, encountered same above error.

@makefile
Copy link

makefile commented Jun 3, 2018

@Peng-wei-Yu try lower the score of fg_thr instead of nms thresh.

@Peng-wei-Yu
Copy link

@jwnsu @makefile Thank you for you help. But I tried to lower fg_thr and use only GPU 1, the problem is still there. Have you tried to change the --weights in train_detection, I decided to change the caffemodel and have a try.

@jwnsu
Copy link

jwnsu commented Jun 3, 2018

FYI. coco model seems to work fine (e.g. coco/res50-15s-800-fpn-cascade is fine, res101 runs out of GPU memory on 1080 Ti), suggest you switch to coco flavor from voc.

@zhaoweicai
Copy link
Owner

@Peng-wei-Yu when you change the number of GPUs, you should change the learning rate at the same time, as described in the paper.

@zhaoweicai
Copy link
Owner

@jwnsu The code should have no problem on multi-gpu training or VOC dataset. Try the run the script a couple of times to see if the problem still happens. If the problem is still there, try to lower the learning rate a little bit. If it still cannot be fixed, maybe there is something wrong.

@Peng-wei-Yu
Copy link

@makefile @zhaoweicai When you trained cascade rcnn on your own data, which caffemodel did you use. Your own caffemodel or ResNet-50-model-merge.caffemodel. The picture in my own data have the size of 1600*1200, should I change the short_size and long_size in train.prototxt.

@makefile
Copy link

makefile commented Jun 4, 2018

@Peng-wei-Yu If you use the author's prototxt, you should use the corresponding ResNet-50-model-merge.caffemodel, since it merges the BN layer to scale layer to reduce memory and speed up. You can increase the input size of image if your memory is enough, but the result may not increase too much.

@Peng-wei-Yu
Copy link

@makefile Thank you very much. I'll have a try by using ResNet-50-model-merge.caffemodel.

@GuoxingYan
Copy link

GuoxingYan commented Jun 7, 2018

@makefile @Peng-wei-Yu in BoxGroupOutput layer,the original setting is 0.001, you finally set it?

@GuoxingYan
Copy link

@makefile @Peng-wei-Yu
When I was training, batchsize was equal to 1. There was at least one sample in my own training pictures, but Why is total positive equal to 0 in many iterations during the training process?and my rpn loss is 0.Have you encountered such a problem?
default

@makefile
Copy link

makefile commented Jun 8, 2018

@GuoxingYan I set fg_thr: 0.01 or 0 in all BoxGroupOutput layer. If your positive rois num is always 0, maybe your dataset has some problem.

@GuoxingYan
Copy link

@makefile Did you try to change the short_size and long_size in train.prototxt?when i only changed the short_size or long_size ,There will be an error。

@makefile
Copy link

@GuoxingYan I did not try to change that, since there use Deconvolution layer to upsample, the size maybe need to be multiplier of 32, 64 or larger.

@GuoxingYan
Copy link

@makefile thank you very much!!

@GuoxingYan
Copy link

@makefile Will you have the following problems when training fpn?
default

@makefile
Copy link

@GuoxingYan I didn't met. the integer seems to be abnormal big.

@licy5152
Copy link

@Peng-wei-Yu @zhaoweicai my own data size is 960*1280,I try to use the ResNet-50-model-merge.caffemodel, but I also get this problem.
wx20180624-154016 2x

@GuoxingYan
Copy link

@makefile @zhaoweicai @Peng-wei-Yu When I was training, I found that the short_size in detection_data_param in trian.prototxt is 800, which is exactly equal to img_width and img_height in proposal_target_param. So the question arises. When I change the short_size to 320, does the img_width and img_height need to be changed to 320?

@makefile
Copy link

@GuoxingYan I think it needs to be.

@licy5152
Copy link

@makefile I use to train my owe dataset,how can I get the output for every picture?

@makefile
Copy link

makefile commented Jul 2, 2018

@licy5152 I wrote a python script CascadeRCNN-demo.py imitate the matlab code, you can modify it to use.

@GuoxingYan
Copy link

@makefile 你的demo.py 显示无效链接诶。

@makefile
Copy link

makefile commented Jul 2, 2018

@GuoxingYan 你的网络问题吧

@PacteraKun
Copy link

@makefile @zhaoweicai
When I was training my own dataset, the following issue happened. However, I have already check that there is no box has xmin = 1664 and xmax = 636 in the window_file.txt. And I also have not found bbox_util.cpp file under the workspace directory. Could you guys help me to solve this issue? Thanks a lot.
image

@makefile
Copy link

makefile commented Jul 9, 2018

@PacteraKun The situation you encountered is unusual, check carefully.

@PacteraKun
Copy link

@makefile
Have you use cascade-rcnn to train your own dataset successfully?

@makefile
Copy link

makefile commented Jul 9, 2018

@PacteraKun I once trained several model, but failed to visualize the demo result. Later I transplant it to my own familiar framework to use.

@lzh19961031
Copy link

@makefile 请问下,你test那个python文件中的labelmap_file是什么呢?

@DetectionIIT
Copy link

@GuoxingYan @zhaoweicai
I meet the same error?have you solved?
there are many params are equal to -1 and can't save the model??

I0806 23:44:24.048591 20123 solver.cpp:219] Iteration 9900 (2.14913 iter/s, 46.5305s/100 iters), loss = 0.440841
I0806 23:44:24.048627 20123 solver.cpp:238] Train net output #0: bbox_iou = -1
I0806 23:44:24.048635 20123 solver.cpp:238] Train net output #1: bbox_iou_2nd = -1
I0806 23:44:24.048638 20123 solver.cpp:238] Train net output #2: bbox_iou_3rd = -1
I0806 23:44:24.048641 20123 solver.cpp:238] Train net output #3: bbox_iou_pre = -1
I0806 23:44:24.048645 20123 solver.cpp:238] Train net output #4: bbox_iou_pre_2nd = -1
I0806 23:44:24.048648 20123 solver.cpp:238] Train net output #5: bbox_iou_pre_3rd = -1
I0806 23:44:24.048651 20123 solver.cpp:238] Train net output #6: cls_accuracy = 0.984375
I0806 23:44:24.048655 20123 solver.cpp:238] Train net output #7: cls_accuracy_2nd = 0.972656
I0806 23:44:24.048658 20123 solver.cpp:238] Train net output #8: cls_accuracy_3rd = 0.964844
I0806 23:44:24.048666 20123 solver.cpp:238] Train net output #9: loss_bbox = 0.0117847 (* 1 = 0.0117847 loss)
I0806 23:44:24.048671 20123 solver.cpp:238] Train net output #10: loss_bbox_2nd = 0.0129223 (* 0.5 = 0.00646114 loss)
I0806 23:44:24.048676 20123 solver.cpp:238] Train net output #11: loss_bbox_3rd = 0.00699362 (* 0.25 = 0.0017484 loss)
I0806 23:44:24.048681 20123 solver.cpp:238] Train net output #12: loss_cls = 0.0294972 (* 1 = 0.0294972 loss)
I0806 23:44:24.048686 20123 solver.cpp:238] Train net output #13: loss_cls_2nd = 0.0663875 (* 0.5 = 0.0331937 loss)
I0806 23:44:24.048689 20123 solver.cpp:238] Train net output #14: loss_cls_3rd = 0.0622066 (* 0.25 = 0.0155517 loss)
I0806 23:44:24.048696 20123 solver.cpp:238] Train net output #15: rpn_accuracy = 0.999953
I0806 23:44:24.048701 20123 solver.cpp:238] Train net output #16: rpn_accuracy = -1
I0806 23:44:24.048703 20123 solver.cpp:238] Train net output #17: rpn_bboxiou = -1
I0806 23:44:24.048708 20123 solver.cpp:238] Train net output #18: rpn_loss = 0.000343773 (* 1 = 0.000343773 loss)
I0806 23:44:24.048713 20123 solver.cpp:238] Train net output #19: rpn_loss = 0 (* 1 = 0 loss)
I0806 23:44:24.048717 20123 sgd_solver.cpp:105] Iteration 9900, lr = 0.0002
I0806 23:45:10.848093 20123 solver.cpp:587] Snapshotting to binary proto file /disk1/g201708021059/cascade-rcnn/examples/voc/res101-9s-600-rfcn-cascade/log/cascadercnn_voc_iter_10000.caffemodel
*** Aborted at 1533570310 (unix time) try "date -d @1533570310" if you are using GNU date ***
PC: @ 0x7f55674532e7 caffe::Layer<>::ToProto()
*** SIGSEGV (@0x0) received by PID 20123 (TID 0x7f55682b49c0) from PID 0; stack trace: ***
@ 0x7f5565dedcb0 (unknown)
@ 0x7f55674532e7 caffe::Layer<>::ToProto()
@ 0x7f55675d7533 caffe::Net<>::ToProto()
@ 0x7f55675f415f caffe::Solver<>::SnapshotToBinaryProto()
@ 0x7f55675f42f2 caffe::Solver<>::Snapshot()
@ 0x7f55675f7f7a caffe::Solver<>::Step()
@ 0x7f55675f8994 caffe::Solver<>::Solve()
@ 0x40d4c0 train()
@ 0x408d32 main
@ 0x7f5565dd8f45 (unknown)
@ 0x409442 (unknown)
@ 0x0 (unknown)

@GuoxingYan
Copy link

@Emmra https://blog.csdn.net/e01528/article/details/80913443 希望能帮到你,可以的话,帮忙点个赞。

@GuoxingYan
Copy link

@Emmra 保存不了caffemodel的问题我没有遇到

@GuoxingYan
Copy link

@lzh19961031 那个检测的python你能打开吗?
我这边试了好几次没有打开,方便的话可以给我 发一下吗?
921905071@qq.com

@huinsysu
Copy link

@makefile @GuoxingYan @licy5152 请问,在train.prototext文件中的long_size和short_size的作用是什么呢?我得数据集中有的图片长宽分别为6000和4000,我需要在设置这两个参数为6000和4000吗?谢谢!

@huinsysu
Copy link

@makefile @zhaoweicai Hi,I try to lower the fg_th in BoxGroupOutput layer, but I still get the problem of keep_num > 0(0 vs. 0). Could I just set the fg_th to 0 for all fg_th in BoxGroupOutput layer? Thanks for help me!

@makefile
Copy link

@huinsysu the size is about input resize. 6000x4000 maybe too large to fit into 1 gpu.

@huinsysu
Copy link

@licy5152 Hi, when I trained the model with my own dataset, I met the same error as you met. Would you please tell me how you solve such problem? Thanks!

@lininglouis
Copy link

@zhaoweicai May I know the intuition of using fg_thr ( or when the cls_score is 0.99 or higher) to filter the bboxes? It seems that you drop all those bboxes. ( they dont even get into the nms_by_cls_score or proposal stage). So why drop the bbox whose cls_score is higher than 0.99 by default?

@elgong
Copy link

elgong commented Oct 10, 2018

网络能正常训练了,但是每次 Ctrl +c 终止程序,会出现 “irq/132-nvidia”的root进程,cpu100%占用,内存占用0,重新执行训练会卡在最开始的地方,Nvidia-smi也卡住了:

5242 root -51 0 0 0 0 R 100.0 0.0 29:08.18 irq/132-nvidia
必须重启才能解决,请问您遇到过这个状况吗?

@hu5tao
Copy link

hu5tao commented Dec 3, 2018

@makefile At last,are you satisfied with you results about your datasets? I am preparing for train my dataset in my datasets.

@makefile
Copy link

makefile commented Dec 3, 2018

@hu5tao not bad.

@qianfangjj
Copy link

@licy5152 I wrote a python script CascadeRCNN-demo.py imitate the matlab code, you can modify it to use.

@makefile 你好,我试了好几次都不能打开 CascadeRCNN-demo.py的链接,请问你是否方便发给我一份?422246019@qq.com 谢谢了!

@foralliance
Copy link

@lininglouis
0.99??
I think its just drop cls_score is lower than 0.01

@leizhu1989
Copy link

when I train my own data ,it has a error,but I don't know why,could you give me some ideas? Thanks a lot

I0604 13:28:15.270220 87804 detection_data_layer.cpp:142] num: 0 /home/zhulei/data/VOCdevkit/VOC2007/JPEGImages/IMG_0_112.jpg 3 1080 1920 windows to process: 36, RONI windows: 0
F0604 13:28:15.274016 87804 detection_data_layer.cpp:123] Check failed: label > 0 (0 vs. 0)
*** Check failure stack trace: ***
@ 0x7fb962af05cd google::LogMessage::Fail()
@ 0x7fb962af2433 google::LogMessage::SendToLog()
@ 0x7fb962af015b google::LogMessage::Flush()
@ 0x7fb962af2e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7fb963271781 caffe::DetectionDataLayer<>::DataLayerSetUp()
@ 0x7fb9631c27d5 caffe::BasePrefetchingDataLayer<>::LayerSetUp()
@ 0x7fb96338b6a2 caffe::Net<>::Init()
@ 0x7fb96338dd0e caffe::Net<>::Net()
@ 0x7fb963312515 caffe::Solver<>::InitTrainNet()
@ 0x7fb963312aa4 caffe::Solver<>::Init()
@ 0x7fb963312d8f caffe::Solver<>::Solver()
@ 0x7fb963335701 caffe::Creator_SGDSolver<>()
@ 0x40d912 train()
@ 0x408795 main
@ 0x7fb961335830 __libc_start_main
@ 0x4090a9 _start
@ (nil) (unknown)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests