Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation detail of training code #10

Closed
Xiangyu-CAS opened this issue Dec 5, 2016 · 38 comments
Closed

Implementation detail of training code #10

Xiangyu-CAS opened this issue Dec 5, 2016 · 38 comments

Comments

@Xiangyu-CAS
Copy link

Hi, tianzhi,
I tried to implement CTPN training code on the framework of py-faster-rcnn (by RBG), but the results were different from yours (of course worse) .

  1. Loss function. Did you revise the loss function (eg: SmoothL1Loss) of training code ?
  2. vertical proposals heights in a textline. A complete textline constitue of several vertical anchors in sequnce, and the heights of them vary slightly in your implementation, however the heights in my implementation vary enormously. Sometime, the proposal fit tightly to the boundry of single character, if the heights of characters in a textline differs greatly, heights of proposals differs too. So I want to ask the question : how did you make the heights and y coordinate of proposal sequence uniformed ? Via lstm? Or, other kind of change in python layer? If the answer is lstm , does that mean lstm not working in my implementation ? FTPN (CTPN with No RNN) seems have the same problem.
@tianzhi0549
Copy link
Owner

tianzhi0549 commented Dec 5, 2016

@Xiangyu-CAS Thank you for your interest.

  1. We didn't modify the loss functions.
  2. Yes, lstm can utilize context information to align these text proposals. Maybe your network isn't trained sufficiently or something.

Thank you:-).

@Xiangyu-CAS
Copy link
Author

Xiangyu-CAS commented Dec 7, 2016

Thank you ! I figured it out! It's not the problem of lstm, lstm works well after 40k iterations with trainset consitutes of 20k images . It's RCNN which is responsable. It's seems that RCNN had weeken the sequential information(The input of RCNN is conv5_3 instead of lstm_output). After discardinging RCNN, I got a result which looks reansonable. Comparing to Faster-rcnn, CTPN discard RCNN , did you ever test the perfermance of (CTPN + RCNN) ?

BTW, with your help , I trained my own CTPN model ! From the result, it looks better than faster-rcnn, but still worse than yours. The evaluation result on ICDAR2013 is (R= 81.08 % P= 81.61 % F= 81.34 %) , comparing to my own Faster-rcnn model which end up with (R=75.21 % P=85.85 % F=80.18 % ).
Of course, there are still some problem exist. First , boudingbox is larger than gt, especially small textlines. Second, the boundingbox tend to exceed the begining and end of texline at most time. I think is might be the fault in training set, maybe they did not do a refine work in labeling gt box.

I have never see a person who is as paticent as you in teaching a stranger ! ^_^ Thank you again for your kindness.

@tianzhi0549
Copy link
Owner

@Xiangyu-CAS Thank you. We didn't try to add Fast RCNN behind CTPN because of time and memory cost. Also, I'm very glad that you have got these good results.

@Jayhello
Copy link

@Xiangyu-CAS
Can you give a example of the image you train? And the box config file?
Thank you very much

@Xiangyu-CAS
Copy link
Author

Xiangyu-CAS commented Feb 9, 2017

@Jayhello The implementation is based on the original py-faster-rcnn by Ross Girshick, and the trainset is ICDAR2013 and 20000 labeled picture and then tranformed into VOC fromat. The only difference is that I divided a textline boundingBox GT by fixed width (16 pixel) ,just like Figure 2(Right) in the CTPN paper

@melody-rain
Copy link

Hi @Xiangyu-CAS

Do you do the offset regression in your code?

@melody-rain
Copy link

@Xiangyu-CAS
Also in icdar datasets, the GT is labeled word by word. But with CTPN, the words in a row are grouped together. How do separate the words in a group?

@Xiangyu-CAS
Copy link
Author

Xiangyu-CAS commented Mar 1, 2017

@melody-rain offset regression? you mean side refinement? No,I did not.In CTPN, Lstm tend to group words togther, but you can overcome this by adding space between words as negative GT.

@leftstone2015
Copy link

hi, @Xiangyu-CAS
Thank you for sharing your implements in details. I want to ask 2 questions:

  1. how many images used for training in a batch? RPN used 1 image in a batch, am I right?
  2. negative samples are random selected in order to make the sum of num_bg and num_fg equals 128.
    I think the strategy of selecting negative and positive samples in SSD may be better. Are you following CTPN's paper or others?

thank you in advance.

@Xiangyu-CAS
Copy link
Author

@leftstone2015 The parameters including image per batch and bg_fg num are not a principle thing that matters, I suppose. All this parameters follow the default value in py-faster-rcnn.

@thisismohitgupta
Copy link

thisismohitgupta commented Mar 17, 2017

@leftstone2015 I am too not able to converge this. If I only try the class scores they tend to converge but when the coordinates are added the network just wont converge at any learning rate.

@LearnerInGithub
Copy link

@Xiangyu-CAS Hi, Xiangyu! I have a question that I add a regression layer which for vertical coordinate(cter_y, height), but in the deploy file it output normal 4 element bbox targets, so am confusing how to handle this inconsistence... Thank you in advance!

@Xiangyu-CAS
Copy link
Author

@LearnerInGithub The regression layer apply offsets(cter_y,height) to generated anchors (x,y,h,w) instead of outputing as bbox coordinates. Even if you don't use regression layer, you can get anchors as bbox targets directly.

@LearnerInGithub
Copy link

@Xiangyu-CAS Thanks for your explanation, and I look the network architecture file again, yes! This is not the problem, I think I need change other place. Much appreciate for your explain!

@willyangyang
Copy link

@Xiangyu-CAS did you change the rpn-data layers?Due to the difference of generating anchors and the new size of output ,I change the rpn-data layers from the py-faster-rcnn.Unfortunately I find it is hard to converge .Appreciate for your reply!

@LearnerInGithub
Copy link

@Xiangyu-CAS Another question, how could we testing our implemented CTPN on test set of some dataset like ICDAR2013 or ICDAR2015? The test.py in faster_rcnn module seems prefer to test the trained fast_rcnn model, so how could I modify the faster_rcnn_test.pt to make the testing normal running?

@Xiangyu-CAS
Copy link
Author

@willyangyang rpn-data layer? you mean roi-data perhaps ? I don't know what's wrong with your code, but I did change rot-data layer. Two kinds of modification are applied: (1) divide gt to vertical gt (2) revise anchor to vertical anchor. I guess you did not divide gt to fixed width vertical gt

@Xiangyu-CAS
Copy link
Author

Xiangyu-CAS commented Mar 28, 2017

@LearnerInGithub That's easy. In test.py , faster-rcnn obtains roi score from blob "cls_prob" and applys BBOX_REG for boxes. You can get RPN output directly by blob "roi_scores" and "rois" without BBOX_REG. CTPN is a novel RPN. thus you can get CTPN result in this way. Additionaly , the result you get is sequences vertical proposals without connection, you need to connect these to a complete text proposal. The conection function is provide in Tianzhi's test code.

@LearnerInGithub
Copy link

@Xiangyu-CAS I can't access the blob named 'roi_scores' in test.py, the program will stop at this line:scores = blobs_out['roi_scores'], and give me an error: scores = blobs_out['roi_scores']
KeyError: 'roi_scores'.....

@LearnerInGithub
Copy link

@Xiangyu-CAS And also can't find the blob named "cls_prob", error: scores = blobs_out['cls_prob']
KeyError: 'cls_prob'

@LearnerInGithub
Copy link

@Xiangyu-CAS @tianzhi0549 I want ask how to test the pre-trained model on ICDAR2013? Seems CTPN always output one line of text bbox, but the ICDAR2013 only mark the text word region. Could we improve the detection-level to word instead of text line?

@Xiangyu-CAS
Copy link
Author

@LearnerInGithub It is the innate defect of CTPN I suppose. RNN tends to connect words togther. You can add space as negative samples to improve it as tianzhi said.

@Xiangyu-CAS
Copy link
Author

@LearnerInGithub blob can't find : you can refer to model file to find out that blob. 'roi_scores' is in proposal_layer, commented by default.

@willyangyang
Copy link

@Xiangyu-CAS Have you ever changed the SmoothL1Loss and SoftmaxWithLoss?I find it's hard to converge

@LearnerInGithub
Copy link

@Xiangyu-CAS Thank you very much for your suggestions and answering!

@Xiangyu-CAS
Copy link
Author

@willyangyang You dont need to change any loss function. SmoothL1Loss and SoftmaxWithLoss are fundamental loss function. Some works change SmoothL1Loss to SmoothL2Loss and or so called 'SmoothLn' to converge in a better way. However, when you change your learning goal, you dont need to change these functions. For example, you dont need to change SGD algorithm when you training a recognition task or a segmentation task.

@willyangyang
Copy link

@Xiangyu-CAS my rpn-score loss fluctuate around 0.3,after 30k iteration.I choose coco-text for my training data, is there a better choice?

@thisismohitgupta
Copy link

@willyangyang indeed my score loss too revolves around 0.3 and smoothL1 loss at ~0.05 on MS-COCO. Even on a very small dataset I was trying to overfit just to test it. @Xiangyu-CAS any help or suggestions?

@Xiangyu-CAS
Copy link
Author

Xiangyu-CAS commented Mar 30, 2017

@willyangyang @thisismohitgupta We trained our model on private dataset which consists of focused scene text images similar to ICDAR2013. I suggest you train your model on ICDAR2017 MLT DATASET, MLT DATASET consist of focused Latin languages including Italian , French and German. We provide 1,000 images for each of them. Some of them had been published, and the rest will come soon. You can find it on the competition websites.

@thisismohitgupta
Copy link

@willyangyang just got it working last night. I changed the weight distribution method originally used in paper and a higher learning rate of 0.1 and per channel mean subtraction. I dint test it on whole dataset just on a small subset so theres still work to be done.

@willyangyang
Copy link

@thisismohitgupta how many iterations have you trained when convergence is met?

@thisismohitgupta
Copy link

@Xiangyu-CAS about GT calculation
I start with defining a numpy array of zeros with the shape of (number of proposals, score) score=0 for now

Then I iterate over each GT annotations which Ive aready sliced to 16 and then calculate the IOU of gt sliced and propsal box

If its greater than 0.7 is mark the score as 1 for that particular proposal

Loop goes on and the proposals that do not fall in the category of > 0.7 remain zero
am I doing it right??

@Xiangyu-CAS
Copy link
Author

@thisismohitgupta From your description, it seems right, and that's what py-faster-rcnn had done except dividing width flexible GT to width fixed (16) GT

@thisismohitgupta
Copy link

@Xiangyu-CAS you were right MSCOCO text wont work here becuase more refined word level annotations are needed. So the reason I couldn't get past 50%acc in the score predictions which is even worse than random so I am trying your suggeted dataset of MLT

@athmey
Copy link

athmey commented Jun 19, 2017

Hi, tianzhi,

May I have your email address?

I have sth in details to talk with you.

Sincerely waiting for your response.

Thank you

@AnshulBasia
Copy link

AnshulBasia commented Jul 27, 2017

@thisismohitgupta @Xiangyu-CAS For training purposes, can proposals produced by class AnchorText (locate_anchors ) be used ? If not, am I supposed to take the image, feed it to py-faster rcnn , get those rpn propsals, and label them with IOU with gt. Would that be my ground truth for training?

@tonmoyborah
Copy link

Hi @Xiangyu-CAS . What if I want to club words in a sentence together. If there is more space, this ctpn implementation tries to separate the sections. Even when I trained it on a custom dataset, ctpn predicts separate boxes for words some spaces apart. Is there a parameter I can tune to not do this? ie club words in a straight line together in one box? Thanks a lot for the help!

@Xiangyu-CAS
Copy link
Author

Xiangyu-CAS commented Mar 31, 2018

@tonmoyborah The best way is to re-train this model with annotations labeled in sentence rather than word. What you get from NN heavily depends on what is GT looks like. If your GT are labeled in sentence level, then you get sentence level predcition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests