Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Py-Faster-Rcnn using Resnet #122

Open
hoticevijay opened this issue Mar 21, 2016 · 43 comments
Open

Py-Faster-Rcnn using Resnet #122

hoticevijay opened this issue Mar 21, 2016 · 43 comments

Comments

@hoticevijay
Copy link

Based on #62
I am trying to train my own dataset using resnet+py-faster-rcnn (using @siddharthm83 train.txt). I am getting the following error.

I0321 07:29:44.037149 1892 solver.cpp:60] Solver scaffolding done.
Loading pretrained model weights from data/imagenet_models/resnet.caffemodel
I0321 07:29:44.240974 1892 net.cpp:816] Ignoring source layer fc1000
I0321 07:29:44.241065 1892 net.cpp:816] Ignoring source layer prob
Solving...
F0321 07:29:45.412804 1892 syncedmem.cpp:56] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
Aborted (core dumped)

I am using AWS instance. I was able to train resnet-50 (without fast-rcnn) using the same instance with same dataset. But when I tried using py-faster-rcnn, I am getting this error. I know this error could possibly be due to insufficient memory. So I changed the batch size in deploy.prototxt (iter_size: 1). But still I am getting the error. Can someone help me out?

@abhirevan
Copy link

Experiment with different batch sizes in the yml file or in config.py

@happyharrycn
Copy link

ResNet 101 on Caffe requires >10G GPU memory for an input with VGA resoltuion (640*480) during training (even when you fixed all conv1_x/conv2_x/conv3_x layers). A couple of memory optimizations can be done easily though.

  • Always compile Caffe with cuDNN to avoid internal buffer for conv layers.
  • Merge Batchnorm + Scale into a single Scale layer. The current implementation of Batchnorm in Caffe takes way too much memory. Or more radically, you can fold conv + batchnorm + scale into a single conv with decreased detection performance (as now you can not freeze batchnorm+scale layers).
  • Using the in-place eltwise sum within the PR here

A more fundamental solution is to allow Caffe to reuse the gradients (diff) for each blob. One can safely rewrite the diff of a blob when the weights of all layers including the blob had been updated. And that's the way to train ResNet 101 using a batch size of 32 on 12G memory as mentioned in the original paper.

@kl2005ad
Copy link

@happyharrycn I find something interesting when I train resnet + faster rcnn on my own dataset. If I fix all batchNorm+scale layers on conv1 ~ conv4, and only allow updates on conv layers, the resulting model is far from the paper claims. If I allow batchNorm+scale to update, it gets much better performance (close to vgg16). But faster rcnn only uses one image per batch and is not supposed to update batchNorm+scale properly. What am I missing?

@happyharrycn
Copy link

@kl2005ad ResNet for detection was re-produced at multiple sites. And I am not sure why you are getting worse performance based on your description here. The last time I tried on VOC, it is working better than they claimed in the paper :) Here are some implementation details I used for training.

  1. Freeze all batchNorm+Scale layers (from conv1_x ~ conv5_x)
  2. Optional: freeze conv1_x to conv3_x to save some memory / time
  3. Put a ROI pooling layer at the end of conv4_x, use conv5_x as the classifier (similar to FC layers in VGG16) and attach two branch prediction (softmax with loss for classification and smooth L1 loss for box regression) after the average pooling.
  4. You might want to change the ROI pooling from 14 * 14 -> 7 * 7 and increase the resolution in conv5_x (change the downsample layers by setting their stride to 1). This is not exactly equivalent to the orginal paper but helps to detect smaller objects.

@banxiaduhuo
Copy link

@kl2005ad @happyharrycn Hi! I try to train resnet50 with faster-rcnn. And I got a very low result on voc2007, about 0.47, even lower than ZF model's 0.62. What's the result you got on voc2007? I didn't freeze the batchNorm+Scale layers.

Is my solver correct?
base_lr: 0.001
lr_policy: "multistep"
gamma: 0.1
stepvalue: 300000
stepvalue: 500000
display: 20
momentum: 0.9
weight_decay: 0.0001
snapshot: 0

Thank you very much!

@happyharrycn
Copy link

I was getting mAP ~0.73-0.74 on VOC07 test when using ResNet101 (trained on VOC07 trainval) with 60K iterations. Training details can be found in my previous post in this thread. By a quick look at your solver file, I think you probably had too many iterations (500K is way too much).

@banxiaduhuo
Copy link

@happyharrycn Thank you~ I was getting mAP 0.65 on VOC2007 using ResNet50 with 70k iterations, according your implementation details to fix all batchNorm+scale and conv_x. I will try to use ResNet101. It seems BN must be fixed when fine-tuning, right?
And how to merge Batchnorm + Scale and let Caffe to reuse the gradients for each blob?Can you give me some guidance?

@c149028
Copy link

c149028 commented Apr 11, 2016

@banxiaduhuo, maybe a BatchNorm layer acts as a Scale layer during test time, and we can merge two consecutive Scale layers into one.
@happyharrycn, thank you for sharing your knowledge. It's really helpful! I have a question about reusing the diff blobs. Does that mean we need to modify caffe so the backprop (gradients computation and parameters update) processes layer-by-layer?

@ice-pice
Copy link

ice-pice commented Jun 19, 2016

@happyharrycn I am trying to finte-tune resnet-50 and faster-rcnn for COCO dataset as mentioned in Kaiming's paper by using a learning rate of 0.001 for 240K iterations and 0.0001 for next 80K iterations (using the provided end2end training).
It appears that these number of iterations are way too much because val AP score starts decreasing from iteration 150K onwards. Can you share some insights on how many iterations are required to train on 80K images of COCO datset?
Thanks!

@happyharrycn
Copy link

happyharrycn commented Jun 20, 2016

@ice-pice I think 240K + 80K is actually not enough iterations for training on COCO. I have used 500K iterations for 120K images (COCO train + val). Have you tried to keep the training running for the full 320K iteration and check whether the AP keeps decreasing after 150K?

@ice-pice
Copy link

ice-pice commented Jun 27, 2016

  1. @happyharrycn I was skipping average pooling from ResNet which was leading to poor results after 150K iterations as mentioned. Found my mistake when I generated the model using https://github.com/XiaozhiChen/resnet-generator. I'd presume without average pooling, there will be too many parameters between last convolutional and first fully connected layer that are difficult to calibrate using fine-tuning.
  2. Following is my network configuration and results. Can you please emphasize on points due to which my AP scores are below the reported ones?
  • Model : ResNet-50 + faster-rcnn

  • Train/Test set : COCO train/validation set

  • Iterations : 320K (240K with lr = 0.001 and 80 K with lr = 0.0001)

  • Mini-batch : 1 image generating 256 proposals (as mentioned in faster-rcnn paper)

  • Detection network initialization : ResNet-50 Imagenet trained weights

  • RPN network initialization : Random initialization

  • Results : AP (IoU=0.5) scores at different iterations for val set

    image

I want to notify that my 320K iterations process 320K images. In the 500K iterations you mentioned, do you process 500K images or 500*8K?

Thanks!

@CrossLee1
Copy link

@ice-pice
I also tried to train resnet50+faster rcnn on the COCO dataset.
However, the training speed is very slow, about 4s / iter, and the loss seems not decrease at all.

What is your training speed? Could you share your log file so I can see the change of loss?
Thanks~

@ice-pice
Copy link

ice-pice commented Jun 30, 2016

@CrossLee1
It takes around 1s/iter for me to train resnet50 + faster-rcnn on NVidia TitanX. 4s/iter seems like too much, it could have been happening because you are adding unnecessary additional parameters into the architecture. You can validate your prototxt by comparing it with a generated prototxt from https://github.com/XiaozhiChen/resnet-generator.

From my observation change of loss is not reflective of convergence in this case because of a mini-batch size of 1. After 50K iterations or so, the loss value fluctuates in the same interval until 320K.
Link to log for the model I have trained in my previous comment: https://drive.google.com/file/d/0B4AOlDvVIP8RMUxQS0M5dWJEYjg/view?usp=sharing

I am still making changes and if I reach the baseline, I can share the prototxt with you if you'll like.
Cheers!

@CrossLee1
Copy link

@ice-pice
Thanks for your reply~
I will use the generated prototxt from https://github.com/XiaozhiChen/resnet-generator and try again.

Wish you have a good result~

@CrossLee1
Copy link

@ice-pice
Wonder if you have succeeded in the training of resnet50 + py-faster-rcnn and reached the baseline?
Hope to get your results~

@cervantes-loves-ai
Copy link

how to solve this ?
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0718 17:01:06.813407 26562 _caffe.cpp:122] DEPRECATION WARNING - deprecated use of Python interface
W0718 17:01:06.813459 26562 _caffe.cpp:123] Use this instead (with the named "weights" parameter):
W0718 17:01:06.813477 26562 _caffe.cpp:125] Net('/home/rvlab/Documents/fast-rcnn-master/models/VGG16/test.prototxt', 1, weights='/home/rvlab/Documents/fast-rcnn-master/data/fast_rcnn_models/vgg16_fast_rcnn_iter_40000.caffemodel')
[libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 392:21: Message type "caffe.LayerParameter" has no field named "roi_pooling_param".
F0718 17:01:06.815565 26562 upgrade_proto.cpp:79] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: /home/rvlab/Documents/fast-rcnn-master/models/VGG16/test.prototxt
*** Check failure stack trace: ***
Aborted (core dumped

@ice-pice
Copy link

@CrossLee1 With resnet50 + py-faster-rcnn, I am able to achieve 45% mAP.
My guess is that if I use resnet101, I can reach closer to the 48% baseline.

@CrossLee1
Copy link

@ice-pice
glad to hear that~
Could you provide your train_val.prototxt for reference or your email so we can discuss it in detail~
Thanks a lot~

@CrossLee1
Copy link

@sarkeribrahim
do you implement this step?

`Build the Cython modules

cd $FRCN_ROOT/lib
make`

@ice-pice
Copy link

@CrossLee1 My resnet-50 + faster-rcnn prototxt.

@yjn870
Copy link

yjn870 commented Jul 26, 2016

@ice-pice
Can i get test.prototxt file?
test.prototxt(by resnet generator) is not compatible.

@zimenglan-sysu-512
Copy link

@happyharrycn can you share your train&test prototxt of "resnet + faster rcnn"?
thanks.

@banxiaduhuo
Copy link

@ice-pice ResNet50 and ResNet101 that I trained both close to 44.4 mAP, How about yours?
I wonder if you know how to implements the methods of He's paper, such as Box refinement&context learning&multi-scale testing. My implements seems not good.
Thank you so much!

@ice-pice
Copy link

@banxiaduhuo I did implement box refinement strategy and it gave me a 1.3% boost as compared to 2% mentioned in the paper. Did not get a chance to try the other 2 strategies.
You should read the details very carefully in the paper, I practically followed everything line by line.
Hope it helps!

@ice-pice
Copy link

@yjn870 : You need to remove the topmost data layer as it is not required while testing.
Take some hints from https://github.com/rbgirshick/py-faster-rcnn/blob/master/models/coco/VGG16/faster_rcnn_end2end/test.prototxt

Remove all the layers which are unnecesary while testing.

@rajiv235
Copy link

Hello. I am trying to run Resnet101 with Faster RCNN on AWS 4gb K520 gpu. I realized that this GPU memory wont be enough and got the same error.
0321 07:29:44.037149 1892 solver.cpp:60] Solver scaffolding done.
Loading pretrained model weights from data/imagenet_models/resnet.caffemodel
I0321 07:29:44.240974 1892 net.cpp:816] Ignoring source layer fc1000
I0321 07:29:44.241065 1892 net.cpp:816] Ignoring source layer prob
Solving...
F0321 07:29:45.412804 1892 syncedmem.cpp:56] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
Aborted (core dumped)

I wanted to ask if AWS with g2.8xlarge instance( 4 GPUs with 4GB each) should do the job and has anyone tried that?

Thanks. Any help will be appreciated.

@EthanReid
Copy link

Hi everyone. I am having trouble with the make file and finding ./tools/demo.py. It says for the make file no targets found. I installed it in cuda/fast-rcnn/lib and cuda. When is try to run ./tools/demo.py it says no directory found.

@zimenglan-sysu-512
Copy link

@ice-pice @banxiaduhuo can you share how you do bbox refinement?

@zimenglan-sysu-512
Copy link

zimenglan-sysu-512 commented Aug 30, 2016

@happyharrycn @banxiaduhuo i try resnet50 and 80k iterations, get 73.59% mAP for pascal voc 07.

@spandanagella
Copy link

Did anyone try ResNet-50 or higher depth with COCO classes using pyfaster-rcnn?

@KeyKy
Copy link

KeyKy commented Nov 24, 2016

I tried ResNet-50 prototxt from ice-pice and ResNet-50 prototxt from siddharthm83. The base_lr=0.001(step=300000), total_iters=490000. However, I only get map 0.265(IoU=0.5) in coco. Did anyone have ResNet-50+py-faster-rcnn pretrained model of coco?

@ice-pice @banxiaduhuo I can not reproduce your result, Could you help me?

@Eniac-Xie
Copy link

@spandanagella I release a implementation (prototxt file and model weights) of ResNet-101 based faster-rcnn, check this repos

@spandanagella
Copy link

@Eniac-Xie Thanks. I am looking for ResNet model trained on COCO object categories. Looks like you have resnet based faster-rcnn for just PASCAL-VOC.

@onkarganjewar
Copy link

onkarganjewar commented Feb 23, 2017

@KeyKy @ice-pice @banxiaduhuo @Eniac-Xie @zimenglan-sysu-512

I'm trying to train the ResNet-50 model on PASCAL VOC 2007 trainval dataset. I've followed the comments in issue #62. So, I'm using this command to start the training

./tools/train_net.py --gpu 1 --weights data/imagenet_models/ResNet-50-model.caffemodel --imdb voc_2007_trainval --cfg experiments/cfgs/faster_rcnn_end2end.yml --solver models/ResNet-50/faster_rcnn_end2end/solver.prototxt

I'm using the solver/train prototxt files from @twtygqyy repo

However, I'm getting this error:

Normalizing targets done WARNING: Logging before InitGoogleLogging() is written to STDERR I0222 13:59:58.538053 23076 solver.cpp:54] Initializing solver from parameters: test_iter: 100 test_interval: 1000 base_lr: 0.0001 display: 100 max_iter: 200000 lr_policy: "multistep" gamma: 0.1 momentum: 0.9 weight_decay: 0.0001 stepsize: 20000 snapshot: 10000
snapshot_prefix: "resnet50_train" solver_mode: GPU
net: "models/ResNet-50/faster_rcnn_end2end/ResNet-50-train_val.prototxt"
test_initialization: false
I0222 13:59:58.538121 23076 solver.cpp:96] Creating training net from net file: models/ResNet-50/faster_rcnn_end2end/ResNet-50-train_val.prototxt
[libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 74:26: Message type "caffe.LayerParameter" has no field named "batch_norm_param".
F0222 13:59:58.538242 23076 upgrade_proto.cpp:928] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: models/ResNet-50/faster_rcnn_end2end/ResNet-50-train_val.prototxt
*** Check failure stack trace: *** Aborted (core dumped)

I'm on the latest commit of faster-rcnn branch of caffe-fast-rcnn

Pardon my lack of knowledge, but would you guys mind helping me resolve this error, please? Appreciate it. Thanks.

@joyivan
Copy link

joyivan commented Mar 2, 2017

make it

@twtygqyy
Copy link

twtygqyy commented Mar 2, 2017

Here is the caffe-fast-rcnn with upstream caffe https://github.com/twtygqyy/caffe-fast-rcnn-upstream

@646677064
Copy link

646677064 commented May 9, 2017

@onkarganjewar @twtygqyy @Eniac-Xie @happyharrycn I tried R-FCN + ResNet-101(from jifeng dai Orpine https://github.com/Orpine/py-R-FCN),why R-FCn-ohem take 5g of the gpu memory ,but the faster-rcnn+resnet-101-bn-scale-merged-ohem (from @Eniac-Xie) take 11g of the gpu memory. I don't know what difference make it. Somebody,Please

@murphypei
Copy link

@646677064 because of the full connection layers , they have most of the parameters.

@Dareschoels
Copy link

I'm using matlab 2017a student version, gpu: gtx 1060 (6 gb)
I have few questions related to Matlab , hope so i will get the answers i need, thanks.
1.Is there any special requirements if i want to make my own set of images for Re-train RestNet or AlexNet and which 1 is better?
2. When i retrain network for my purpose how much epoch should be optimal number(guess that depends on my Data set)?
3.How much gpu memory does Faster RCNN object detector requires and do i have manually to label images or there's some faster way?
4.Implementing Faster RCNN detector on video for real time detection any tips about what tool to use?

@whmin
Copy link

whmin commented Nov 9, 2017

@646677064 I have tried faster-rcnn+resnet-50-bn-scale-merged-ohem(from @Eniac-Xie),but there is an error when i run "./experiments/scripts/faster_rcnn_end2end.sh 0 ResNet-50 pascal_voc",like this:
screenshot from 2017-11-08 21-08-43

Do you know how to solve it?Can you share your faster-rcnn+resnet-101-bn-scale-merged-ohem (from @Eniac-Xie) test.prototxt file with me?Thank you very much!!!

@nnop
Copy link

nnop commented Dec 1, 2017

Have you found the reason of training slow problrm? I met the same issue. About 4s/iter on Titan X. @CrossLee1

@YoungMagic
Copy link

Same thing happened to me! Any idea yet? @nnop

@mantou22
Copy link

Excuse me.When I trained my own model, I used the model I trained to run demo.py to detect the graph. When the pixel was large (5000,3000), the results were all white include image.If the image pixel is not too large, there is no problem.What's the reason?(当我训练好自己的模型时,用自己训练的模型运行demo.py,去检测图形,当检测图片像素很大时(5000,3000),检测出来的结果是全白包括图片。如果图片像素不是太大,就不会出问题。请问这是什么原因?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests