Py-Faster-Rcnn using Resnet #122

hoticevijay · 2016-03-21T11:03:13Z

Based on #62
I am trying to train my own dataset using resnet+py-faster-rcnn (using @siddharthm83 train.txt). I am getting the following error.

I0321 07:29:44.037149 1892 solver.cpp:60] Solver scaffolding done.
Loading pretrained model weights from data/imagenet_models/resnet.caffemodel
I0321 07:29:44.240974 1892 net.cpp:816] Ignoring source layer fc1000
I0321 07:29:44.241065 1892 net.cpp:816] Ignoring source layer prob
Solving...
F0321 07:29:45.412804 1892 syncedmem.cpp:56] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
Aborted (core dumped)

I am using AWS instance. I was able to train resnet-50 (without fast-rcnn) using the same instance with same dataset. But when I tried using py-faster-rcnn, I am getting this error. I know this error could possibly be due to insufficient memory. So I changed the batch size in deploy.prototxt (iter_size: 1). But still I am getting the error. Can someone help me out?

abhirevan · 2016-03-22T17:17:51Z

Experiment with different batch sizes in the yml file or in config.py

happyharrycn · 2016-03-22T22:34:29Z

ResNet 101 on Caffe requires >10G GPU memory for an input with VGA resoltuion (640*480) during training (even when you fixed all conv1_x/conv2_x/conv3_x layers). A couple of memory optimizations can be done easily though.

Always compile Caffe with cuDNN to avoid internal buffer for conv layers.
Merge Batchnorm + Scale into a single Scale layer. The current implementation of Batchnorm in Caffe takes way too much memory. Or more radically, you can fold conv + batchnorm + scale into a single conv with decreased detection performance (as now you can not freeze batchnorm+scale layers).
Using the in-place eltwise sum within the PR here

A more fundamental solution is to allow Caffe to reuse the gradients (diff) for each blob. One can safely rewrite the diff of a blob when the weights of all layers including the blob had been updated. And that's the way to train ResNet 101 using a batch size of 32 on 12G memory as mentioned in the original paper.

kl2005ad · 2016-03-29T00:25:52Z

@happyharrycn I find something interesting when I train resnet + faster rcnn on my own dataset. If I fix all batchNorm+scale layers on conv1 ~ conv4, and only allow updates on conv layers, the resulting model is far from the paper claims. If I allow batchNorm+scale to update, it gets much better performance (close to vgg16). But faster rcnn only uses one image per batch and is not supposed to update batchNorm+scale properly. What am I missing?

happyharrycn · 2016-03-29T00:46:50Z

@kl2005ad ResNet for detection was re-produced at multiple sites. And I am not sure why you are getting worse performance based on your description here. The last time I tried on VOC, it is working better than they claimed in the paper :) Here are some implementation details I used for training.

Freeze all batchNorm+Scale layers (from conv1_x ~ conv5_x)
Optional: freeze conv1_x to conv3_x to save some memory / time
Put a ROI pooling layer at the end of conv4_x, use conv5_x as the classifier (similar to FC layers in VGG16) and attach two branch prediction (softmax with loss for classification and smooth L1 loss for box regression) after the average pooling.
You might want to change the ROI pooling from 14 * 14 -> 7 * 7 and increase the resolution in conv5_x (change the downsample layers by setting their stride to 1). This is not exactly equivalent to the orginal paper but helps to detect smaller objects.

banxiaduhuo · 2016-03-29T02:07:35Z

@kl2005ad @happyharrycn Hi! I try to train resnet50 with faster-rcnn. And I got a very low result on voc2007, about 0.47, even lower than ZF model's 0.62. What's the result you got on voc2007? I didn't freeze the batchNorm+Scale layers.

Is my solver correct?
base_lr: 0.001
lr_policy: "multistep"
gamma: 0.1
stepvalue: 300000
stepvalue: 500000
display: 20
momentum: 0.9
weight_decay: 0.0001
snapshot: 0

Thank you very much!

happyharrycn · 2016-03-30T15:23:25Z

I was getting mAP ~0.73-0.74 on VOC07 test when using ResNet101 (trained on VOC07 trainval) with 60K iterations. Training details can be found in my previous post in this thread. By a quick look at your solver file, I think you probably had too many iterations (500K is way too much).

banxiaduhuo · 2016-03-31T02:00:50Z

@happyharrycn Thank you~ I was getting mAP 0.65 on VOC2007 using ResNet50 with 70k iterations, according your implementation details to fix all batchNorm+scale and conv_x. I will try to use ResNet101. It seems BN must be fixed when fine-tuning, right?
And how to merge Batchnorm + Scale and let Caffe to reuse the gradients for each blob？Can you give me some guidance?

c149028 · 2016-04-11T03:48:07Z

@banxiaduhuo, maybe a BatchNorm layer acts as a Scale layer during test time, and we can merge two consecutive Scale layers into one.
@happyharrycn, thank you for sharing your knowledge. It's really helpful! I have a question about reusing the diff blobs. Does that mean we need to modify caffe so the backprop (gradients computation and parameters update) processes layer-by-layer?

ice-pice · 2016-06-19T09:02:24Z

@happyharrycn I am trying to finte-tune resnet-50 and faster-rcnn for COCO dataset as mentioned in Kaiming's paper by using a learning rate of 0.001 for 240K iterations and 0.0001 for next 80K iterations (using the provided end2end training).
It appears that these number of iterations are way too much because val AP score starts decreasing from iteration 150K onwards. Can you share some insights on how many iterations are required to train on 80K images of COCO datset?
Thanks!

happyharrycn · 2016-06-20T21:10:00Z

@ice-pice I think 240K + 80K is actually not enough iterations for training on COCO. I have used 500K iterations for 120K images (COCO train + val). Have you tried to keep the training running for the full 320K iteration and check whether the AP keeps decreasing after 150K?

ice-pice · 2016-06-27T10:44:04Z

@happyharrycn I was skipping average pooling from ResNet which was leading to poor results after 150K iterations as mentioned. Found my mistake when I generated the model using https://github.com/XiaozhiChen/resnet-generator. I'd presume without average pooling, there will be too many parameters between last convolutional and first fully connected layer that are difficult to calibrate using fine-tuning.
Following is my network configuration and results. Can you please emphasize on points due to which my AP scores are below the reported ones?

Model : ResNet-50 + faster-rcnn
Train/Test set : COCO train/validation set
Iterations : 320K (240K with lr = 0.001 and 80 K with lr = 0.0001)
Mini-batch : 1 image generating 256 proposals (as mentioned in faster-rcnn paper)
Detection network initialization : ResNet-50 Imagenet trained weights
RPN network initialization : Random initialization
Results : AP (IoU=0.5) scores at different iterations for val set

I want to notify that my 320K iterations process 320K images. In the 500K iterations you mentioned, do you process 500K images or 500*8K?

Thanks!

CrossLee1 · 2016-06-30T08:24:08Z

@ice-pice
I also tried to train resnet50+faster rcnn on the COCO dataset.
However, the training speed is very slow, about 4s / iter, and the loss seems not decrease at all.

What is your training speed? Could you share your log file so I can see the change of loss?
Thanks~

ice-pice · 2016-06-30T10:14:03Z

@CrossLee1
It takes around 1s/iter for me to train resnet50 + faster-rcnn on NVidia TitanX. 4s/iter seems like too much, it could have been happening because you are adding unnecessary additional parameters into the architecture. You can validate your prototxt by comparing it with a generated prototxt from https://github.com/XiaozhiChen/resnet-generator.

From my observation change of loss is not reflective of convergence in this case because of a mini-batch size of 1. After 50K iterations or so, the loss value fluctuates in the same interval until 320K.
Link to log for the model I have trained in my previous comment: https://drive.google.com/file/d/0B4AOlDvVIP8RMUxQS0M5dWJEYjg/view?usp=sharing

I am still making changes and if I reach the baseline, I can share the prototxt with you if you'll like.
Cheers!

CrossLee1 · 2016-06-30T10:31:52Z

@ice-pice
Thanks for your reply~
I will use the generated prototxt from https://github.com/XiaozhiChen/resnet-generator and try again.

Wish you have a good result~

CrossLee1 · 2016-07-07T07:32:20Z

@ice-pice
Wonder if you have succeeded in the training of resnet50 + py-faster-rcnn and reached the baseline?
Hope to get your results~

cervantes-loves-ai · 2016-07-18T09:00:11Z

how to solve this ?
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0718 17:01:06.813407 26562 _caffe.cpp:122] DEPRECATION WARNING - deprecated use of Python interface
W0718 17:01:06.813459 26562 _caffe.cpp:123] Use this instead (with the named "weights" parameter):
W0718 17:01:06.813477 26562 _caffe.cpp:125] Net('/home/rvlab/Documents/fast-rcnn-master/models/VGG16/test.prototxt', 1, weights='/home/rvlab/Documents/fast-rcnn-master/data/fast_rcnn_models/vgg16_fast_rcnn_iter_40000.caffemodel')
[libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 392:21: Message type "caffe.LayerParameter" has no field named "roi_pooling_param".
F0718 17:01:06.815565 26562 upgrade_proto.cpp:79] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: /home/rvlab/Documents/fast-rcnn-master/models/VGG16/test.prototxt
*** Check failure stack trace: ***
Aborted (core dumped

ice-pice · 2016-07-19T05:49:47Z

@CrossLee1 With resnet50 + py-faster-rcnn, I am able to achieve 45% mAP.
My guess is that if I use resnet101, I can reach closer to the 48% baseline.

CrossLee1 · 2016-07-19T06:25:18Z

@ice-pice
glad to hear that~
Could you provide your train_val.prototxt for reference or your email so we can discuss it in detail~
Thanks a lot~

CrossLee1 · 2016-07-19T06:27:39Z

@sarkeribrahim
do you implement this step?

`Build the Cython modules

cd $FRCN_ROOT/lib
make`

ice-pice · 2016-07-21T10:06:12Z

@CrossLee1 My resnet-50 + faster-rcnn prototxt.

yjn870 · 2016-07-26T06:06:04Z

@ice-pice
Can i get test.prototxt file?
test.prototxt(by resnet generator) is not compatible.

zimenglan-sysu-512 · 2016-08-10T12:39:11Z

@happyharrycn can you share your train&test prototxt of "resnet + faster rcnn"?
thanks.

banxiaduhuo · 2016-08-21T07:49:24Z

@ice-pice ResNet50 and ResNet101 that I trained both close to 44.4 mAP, How about yours?
I wonder if you know how to implements the methods of He's paper, such as Box refinement&context learning&multi-scale testing. My implements seems not good.
Thank you so much!

ice-pice · 2016-08-25T17:13:48Z

@banxiaduhuo I did implement box refinement strategy and it gave me a 1.3% boost as compared to 2% mentioned in the paper. Did not get a chance to try the other 2 strategies.
You should read the details very carefully in the paper, I practically followed everything line by line.
Hope it helps!

ice-pice · 2016-08-25T17:15:56Z

@yjn870 : You need to remove the topmost data layer as it is not required while testing.
Take some hints from https://github.com/rbgirshick/py-faster-rcnn/blob/master/models/coco/VGG16/faster_rcnn_end2end/test.prototxt

Remove all the layers which are unnecesary while testing.

rajiv235 · 2016-08-25T17:53:02Z

Hello. I am trying to run Resnet101 with Faster RCNN on AWS 4gb K520 gpu. I realized that this GPU memory wont be enough and got the same error.
0321 07:29:44.037149 1892 solver.cpp:60] Solver scaffolding done.
Loading pretrained model weights from data/imagenet_models/resnet.caffemodel
I0321 07:29:44.240974 1892 net.cpp:816] Ignoring source layer fc1000
I0321 07:29:44.241065 1892 net.cpp:816] Ignoring source layer prob
Solving...
F0321 07:29:45.412804 1892 syncedmem.cpp:56] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
Aborted (core dumped)

I wanted to ask if AWS with g2.8xlarge instance( 4 GPUs with 4GB each) should do the job and has anyone tried that?

Thanks. Any help will be appreciated.

EthanReid · 2016-08-25T18:04:02Z

Hi everyone. I am having trouble with the make file and finding ./tools/demo.py. It says for the make file no targets found. I installed it in cuda/fast-rcnn/lib and cuda. When is try to run ./tools/demo.py it says no directory found.

zimenglan-sysu-512 · 2016-08-26T05:55:29Z

@ice-pice @banxiaduhuo can you share how you do bbox refinement?

zimenglan-sysu-512 · 2016-08-30T04:58:30Z

@happyharrycn @banxiaduhuo i try resnet50 and 80k iterations, get 73.59% mAP for pascal voc 07.

spandanagella · 2016-10-19T15:05:54Z

Did anyone try ResNet-50 or higher depth with COCO classes using pyfaster-rcnn?

KeyKy · 2016-11-24T09:58:19Z

I tried ResNet-50 prototxt from ice-pice and ResNet-50 prototxt from siddharthm83. The base_lr=0.001(step=300000), total_iters=490000. However, I only get map 0.265(IoU=0.5) in coco. Did anyone have ResNet-50+py-faster-rcnn pretrained model of coco?

@ice-pice @banxiaduhuo I can not reproduce your result, Could you help me?

Eniac-Xie · 2017-02-19T14:28:31Z

@spandanagella I release a implementation (prototxt file and model weights) of ResNet-101 based faster-rcnn, check this repos

spandanagella · 2017-02-22T11:34:19Z

@Eniac-Xie Thanks. I am looking for ResNet model trained on COCO object categories. Looks like you have resnet based faster-rcnn for just PASCAL-VOC.

onkarganjewar · 2017-02-23T03:26:07Z

@KeyKy @ice-pice @banxiaduhuo @Eniac-Xie @zimenglan-sysu-512

I'm trying to train the ResNet-50 model on PASCAL VOC 2007 trainval dataset. I've followed the comments in issue #62. So, I'm using this command to start the training

./tools/train_net.py --gpu 1 --weights data/imagenet_models/ResNet-50-model.caffemodel --imdb voc_2007_trainval --cfg experiments/cfgs/faster_rcnn_end2end.yml --solver models/ResNet-50/faster_rcnn_end2end/solver.prototxt

I'm using the solver/train prototxt files from @twtygqyy repo

However, I'm getting this error:

Normalizing targets done WARNING: Logging before InitGoogleLogging() is written to STDERR I0222 13:59:58.538053 23076 solver.cpp:54] Initializing solver from parameters: test_iter: 100 test_interval: 1000 base_lr: 0.0001 display: 100 max_iter: 200000 lr_policy: "multistep" gamma: 0.1 momentum: 0.9 weight_decay: 0.0001 stepsize: 20000 snapshot: 10000
snapshot_prefix: "resnet50_train" solver_mode: GPU
net: "models/ResNet-50/faster_rcnn_end2end/ResNet-50-train_val.prototxt"
test_initialization: false
I0222 13:59:58.538121 23076 solver.cpp:96] Creating training net from net file: models/ResNet-50/faster_rcnn_end2end/ResNet-50-train_val.prototxt
[libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 74:26: Message type "caffe.LayerParameter" has no field named "batch_norm_param".
F0222 13:59:58.538242 23076 upgrade_proto.cpp:928] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: models/ResNet-50/faster_rcnn_end2end/ResNet-50-train_val.prototxt
*** Check failure stack trace: *** Aborted (core dumped)

I'm on the latest commit of faster-rcnn branch of caffe-fast-rcnn

Pardon my lack of knowledge, but would you guys mind helping me resolve this error, please? Appreciate it. Thanks.

joyivan · 2017-03-02T03:51:37Z

make it

twtygqyy · 2017-03-02T06:58:21Z

Here is the caffe-fast-rcnn with upstream caffe https://github.com/twtygqyy/caffe-fast-rcnn-upstream

646677064 · 2017-05-09T08:21:26Z

@onkarganjewar @twtygqyy @Eniac-Xie @happyharrycn I tried R-FCN + ResNet-101(from jifeng dai Orpine https://github.com/Orpine/py-R-FCN),why R-FCn-ohem take 5g of the gpu memory ,but the faster-rcnn+resnet-101-bn-scale-merged-ohem (from @Eniac-Xie) take 11g of the gpu memory. I don't know what difference make it. Somebody,Please

murphypei · 2017-05-18T07:38:26Z

@646677064 because of the full connection layers , they have most of the parameters.

Dareschoels · 2017-07-02T13:30:13Z

I'm using matlab 2017a student version, gpu: gtx 1060 (6 gb)
I have few questions related to Matlab , hope so i will get the answers i need, thanks.
1.Is there any special requirements if i want to make my own set of images for Re-train RestNet or AlexNet and which 1 is better?
2. When i retrain network for my purpose how much epoch should be optimal number(guess that depends on my Data set)?
3.How much gpu memory does Faster RCNN object detector requires and do i have manually to label images or there's some faster way?
4.Implementing Faster RCNN detector on video for real time detection any tips about what tool to use?

whmin · 2017-11-09T00:32:08Z

@646677064 I have tried faster-rcnn+resnet-50-bn-scale-merged-ohem(from @Eniac-Xie),but there is an error when i run "./experiments/scripts/faster_rcnn_end2end.sh 0 ResNet-50 pascal_voc",like this:

Do you know how to solve it?Can you share your faster-rcnn+resnet-101-bn-scale-merged-ohem (from @Eniac-Xie) test.prototxt file with me?Thank you very much!!!

nnop · 2017-12-01T20:12:40Z

Have you found the reason of training slow problrm? I met the same issue. About 4s/iter on Titan X. @CrossLee1

YoungMagic · 2017-12-18T02:30:20Z

Same thing happened to me! Any idea yet? @nnop

mantou22 · 2018-09-19T12:26:36Z

Excuse me.When I trained my own model, I used the model I trained to run demo.py to detect the graph. When the pixel was large (5000，3000), the results were all white include image.If the image pixel is not too large, there is no problem.What's the reason?(当我训练好自己的模型时，用自己训练的模型运行demo.py,去检测图形，当检测图片像素很大时（5000，3000），检测出来的结果是全白包括图片。如果图片像素不是太大，就不会出问题。请问这是什么原因？)

nazneenrajani mentioned this issue May 3, 2016

ResNet Implementation for Faster-rcnn #62

Closed

onkarganjewar mentioned this issue May 5, 2017

why you remove the BN from original resnet? onkarganjewar/cmpe295-masters-project#3

Closed

Py-Faster-Rcnn using Resnet #122

Py-Faster-Rcnn using Resnet #122

Comments

hoticevijay commented Mar 21, 2016

abhirevan commented Mar 22, 2016

happyharrycn commented Mar 22, 2016

kl2005ad commented Mar 29, 2016

happyharrycn commented Mar 29, 2016

banxiaduhuo commented Mar 29, 2016

happyharrycn commented Mar 30, 2016

banxiaduhuo commented Mar 31, 2016

c149028 commented Apr 11, 2016

ice-pice commented Jun 19, 2016 • edited

happyharrycn commented Jun 20, 2016 • edited

ice-pice commented Jun 27, 2016 • edited

CrossLee1 commented Jun 30, 2016

ice-pice commented Jun 30, 2016 • edited

CrossLee1 commented Jun 30, 2016

CrossLee1 commented Jul 7, 2016

cervantes-loves-ai commented Jul 18, 2016

ice-pice commented Jul 19, 2016

CrossLee1 commented Jul 19, 2016

CrossLee1 commented Jul 19, 2016

ice-pice commented Jul 21, 2016

yjn870 commented Jul 26, 2016

zimenglan-sysu-512 commented Aug 10, 2016

banxiaduhuo commented Aug 21, 2016

ice-pice commented Aug 25, 2016

ice-pice commented Aug 25, 2016

rajiv235 commented Aug 25, 2016

EthanReid commented Aug 25, 2016

zimenglan-sysu-512 commented Aug 26, 2016

zimenglan-sysu-512 commented Aug 30, 2016 • edited

spandanagella commented Oct 19, 2016

KeyKy commented Nov 24, 2016 • edited

Eniac-Xie commented Feb 19, 2017

spandanagella commented Feb 22, 2017

onkarganjewar commented Feb 23, 2017 • edited

joyivan commented Mar 2, 2017

twtygqyy commented Mar 2, 2017

646677064 commented May 9, 2017 • edited

murphypei commented May 18, 2017

Dareschoels commented Jul 2, 2017

whmin commented Nov 9, 2017

nnop commented Dec 1, 2017

YoungMagic commented Dec 18, 2017

mantou22 commented Sep 19, 2018

ice-pice commented Jun 19, 2016 •

edited

happyharrycn commented Jun 20, 2016 •

edited

ice-pice commented Jun 27, 2016 •

edited

ice-pice commented Jun 30, 2016 •

edited

zimenglan-sysu-512 commented Aug 30, 2016 •

edited

KeyKy commented Nov 24, 2016 •

edited

onkarganjewar commented Feb 23, 2017 •

edited

646677064 commented May 9, 2017 •

edited