Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine Tuning Leads to Segmentation Issue #2132

Open
jaddoughman opened this issue Dec 24, 2018 · 54 comments
Open

Fine Tuning Leads to Segmentation Issue #2132

jaddoughman opened this issue Dec 24, 2018 · 54 comments

Comments

@jaddoughman
Copy link

Environment

  • Tesseract Version:
    tesseract 4.0.0
    leptonica-1.77.0
    libjpeg 9c : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11
    Found AVX2
    Found AVX
    Found SSE

  • Platform:
    Ubuntu 16.04

Current Behavior:

I wanted to OCR a large dataset of Arabic newspapers with difficult delimiters and spacing. After running your original pre-trained model, I managed to recall about 80% of the required data. I opted to fine tune your existing ara.traineddata file by using text lines as my training and test data set. I used the "OCR-d Train" tool on GitHub to generate the neccessary .box. files.

Throughout the fine tuning process, the Eval percentages decreased tremendously, which means that the model was successfully trained. I re-evaluated using my own method and confirmed the successful training process.

However, the test dataset used was made up of text lines. So, your and my evaluation were generated on a text line level. The issue occurred when I ran the fine tuned model on a complete Newspaper sample (constituted of the same text line fonts). The accuracy decreased significantly compared to your original pre-trained model. This made no sense at all, my fine tuned model has better accuracy than your model on a text line level, but when running it on a complete newspaper (constituted of the same text line fonts), your pre-trained model is performing better than my successfully fine-tuned model.

The issue seems to be connected to your segmentation algorithm. This is a major problem, since this means that your training tool only works on a text line level and cannot be applied to any other form of dynamic text extraction. You will find below a sample newspaper, my fined tuned model, and the learning curve from the training process.

Sample Newspaper:
Sample Newspaper.zip

Fine Tuned Model:
ara_finetuned.traineddata.zip

Learning Curve:
Learning Curve (60k Iterations).pdf

@jaddoughman
Copy link
Author

Any idea of what might be causing this issue ?

@amitdo @Shreeshrii

@stweil
Copy link
Contributor

stweil commented Dec 29, 2018

Here is a visualisation (using https://github.com/kba/hocrjs) for both results:

The layout recognition is clearly different.

@jaddoughman
Copy link
Author

jaddoughman commented Dec 29, 2018

What would be the reason behind the different layouts ? Why would my fine tuning have an impact ?

Also, thank you for your support @stweil

@stweil stweil pinned this issue Dec 29, 2018
@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Dec 29, 2018

Based on recommendations for tess tutorial in wiki by @theraysmith finetuning should only be done for limited number of iterations. He had suggested 400 for finetuning for 'impact' and 3000 for finetuning to add a character. So, 60000 is probably too large.

Also, please check the --psm being used for training by the ocr-d/train script. Ray has mentioned as part of the LSTM training notes that the models have been trained per line.

@Shreeshrii
Copy link
Collaborator

https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00#integration-with-tesseract

Integration with Tesseract

The Tesseract 4.00 neural network subsystem is integrated into Tesseract as a line recognizer. It can be used with the existing layout analysis to recognize text within a large document, or it can be used in conjunction with an external text detector to recognize text from an image of a single textline.

The neural network engine is the default for 4.00. To recognize text from an image of a single text line, use SetPageSegMode(PSM_RAW_LINE). This can be used from the command-line with -psm 13

The neural network engine has been integrated to enable the multi- language mode that worked with Tesseract 3.04, but this will be improved in a future release. Vertical text is now supported for Chinese, Japanese and Korean, and should be detected automatically.

@stweil Does this mean that layout analysis has changed since tessdata_best was trained?

@jaddoughman
Copy link
Author

As shown in the learning curve uploaded above, the training process was successful (even for 60k iterations). The accuracy improved on a text line level. My issue as explained above and shown in the layout representations, is that of segmentation. When running the trained model on a complete newspaper, the accuracy goess way off.

Have a look at the layout representations above. I used --psm 7 for training.

@Shreeshrii

@stweil
Copy link
Contributor

stweil commented Dec 29, 2018

Does this mean that layout analysis has changed since tessdata_best was trained?

Why do you think so?

@stweil
Copy link
Contributor

stweil commented Dec 30, 2018

@jaddoughman, is this result better?

I added ara.config which was missing in ara_finetuned.traineddata.

@stweil
Copy link
Contributor

stweil commented Dec 30, 2018

There are some more components which could be taken from the original ara.traineddata:

ara.lstm-number-dawg
ara.lstm-punc-dawg
ara.lstm-recoder
ara.lstm-unicharset
ara.lstm-word-dawg

@jaddoughman
Copy link
Author

No, even after adding the dawg files the issue remains. I can't seem to understand how training a model is in any way connected to the segmentation process. The layout representation should be identical in all model or am I wrong ?

@stweil

@stweil
Copy link
Contributor

stweil commented Dec 30, 2018

I‌ would have thought so, too, but recently I‌ noticed some cases which are even more strange:

  • newspaper pages full of text where Tesseract says "empty page"
  • hOCR with no text while the text output shows text

ara.config changes the segmentation process, so that is something which needs to be added to the documentation: add config from original traineddata to finetuned traineddata.

@jaddoughman
Copy link
Author

jaddoughman commented Dec 30, 2018

I trained twice, once including the dawg files and the other excluding them. The training which included the dawg file was better then the one excluding it. However, both were way worse than the original model.

Also, note that training was successful (learning curve attached above). On a text line level, the results are near perfect. However, I need the transcription of the complete newspaper sample. This was a part of a 12-month long research paper, to finally reach this issue is devastating.

On a technical level, there needs to be an explanation of why and how training any model would in any way alter the segmentation process.

@stweil

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Dec 30, 2018

@jaddoughman Which psm are you using for the complete newspaper sample? If is the default i.e. psm 3 then please try the training with --psm 3 (or without specifying the psm) as an experiment and see if the results are better.

@jaddoughman
Copy link
Author

I attached @Shreeshrii 's fine tuned Arabic model below. Is it possible @stweil to generate its corresponding layout representation ? This can help us reach a conclusive decision on our initial assumption concerning the segmentation issue.

Fine Tuned Model: ara-amiri-3000.traineddata.zip

@stweil
Copy link
Contributor

stweil commented Dec 31, 2018

Here it is:

@jaddoughman
Copy link
Author

Original Model: https://ub-blade-01.bib.uni-mannheim.de/~stweil/tesseract/issues/2132/Sample1.html
@Shreeshrii 's Fine Tuned Model: https://ub-blade-01.bib.uni-mannheim.de/~stweil/tesseract/issues/2132/Sample1-ara-amiri-3000.html

The above results prove that our assumption concerning the segmentation is true. Any explanation of the relation between fine tuning and word detection (segmentation) would be greatly appreciated. Understanding the problem can help in finding a workaround.

@stweil @theraysmith

@amitdo
Copy link
Collaborator

amitdo commented Dec 31, 2018

The layout analysis phase detects:

  • tables
  • columns
  • blocks (text / image)
  • text lines

Words and glyphs splitting is part of the OCR phase and not part of the layout analysis phase.

@jaddoughman
Copy link
Author

jaddoughman commented Dec 31, 2018

Why is fine tuning changing the word recognition ? How can I fix my issue ?

Also, if word splitting occurs at OCR phase, then why do I have different results when running the exact same line on a complete newspaper versus on a text line level ? Meaning: if I OCR a single text line I get a different result than OCRing a complete newspaper containing that text line.

@amitdo

@amitdo
Copy link
Collaborator

amitdo commented Dec 31, 2018

Looking again in the code, it seems that words splitting does occur in the layout analysis phase...

I think the word splitting can still be changed by the ocr phase.

Sorry, I don't have answers to your last questions.

@jaddoughman
Copy link
Author

Can the code be altered to include splitting in the OCR phase ? I see no reason why word splitting should be altered during training. My training dataset constituted of 4000 text lines that required crowdsourcing to generate. A lot of time was invested to train the model. Any help would be greatly appreciated.

If any of the other developers have an answer I would be happy to try any alternative fix.

@amitdo

@amitdo
Copy link
Collaborator

amitdo commented Dec 31, 2018

Sorry, I don't know how to help you with this issue.

@Shreeshrii
Copy link
Collaborator

@jaddoughman I unpacked your traineddata file with combine_tessdata. The lstm_unicharset in it has 303 characters. So it seems to me that you have trained using script/Arabic from tessdata_best rather than ara.traineddata.
If that is indeed the case please try starting from ara.traineddata to see if there is any difference.

Also, please share the exact version of tesseract that you are using. Your traineddata file reported beta.3.

@jaddoughman
Copy link
Author

I trained using both the script and the tessdata_best. Both altered the segmentation leading to the same issue. I was using Tesseract 4.0 during training. However, even if another version was used, i also tried your fine tuned model that also resulted in altered word detection.

Is it possible to alter the code so that the word splitting resides in the layout process and not the OCR one ?

@Shreeshrii

@jaddoughman
Copy link
Author

jaddoughman commented Dec 31, 2018

If i attach my training data set, is it possible for you to fine tune it to ensure that the issue isn't related to my training process ?

@Shreeshrii

@Shreeshrii
Copy link
Collaborator

@theraysmith is the only one with enough knowledge about the code to suggest a solution and according to @jbreiden he is now busy on another project at Google.
If you can share your box/tiff pairs I can give a try to fine tuning with it. However, I have to admit that I have not had much success in adding the digits in Arabic script by fine tuning to the traineddata.

@jaddoughman
Copy link
Author

The below dataset contains about 4000 text lines. The txt files below are in RTL order. I was informed that they needed to be changed to LTR. I attempted to change them to LTR by inverting the string of every text file. The below dataset is in RTL since one of my issue might be caused by my conversion attempt. I fine tuned for 60,000 iterations and saw a great improvement in accuracy on a text line level. I think your training attempt can help us reach a conclusive decision on the origin of the issue.

Thank you for your help.

Dataset: dataset.zip

@Shreeshrii

@Shreeshrii
Copy link
Collaborator

According to posts by Ray, training for all languages is done in LTR order and there is a routine in tesseract to handle the change to RTL later.

I do not know Arabic hence can not check whether the conversion is correct. I am relying on text2image to create the correct box files.

I have concatenated your text files to create a training_text for fine tuning. I will run the training with Scheherazade font and share the results.

From my earlier experience fine tuning seems to work best when the training text used is what was used for initial training. For Arabic we do not have that file available, we only have the 80 line training_text (similar to 3.04).

@Shreeshrii
Copy link
Collaborator

@stweil

Does this mean that layout analysis has changed since tessdata_best was trained?

Why do you think so?

No concrete proof :-(

There have been issues with page segmentation, word dropping for a while. There are probably a number of issues related to them still open.

So, something has definitely changed.

If it is not seen in eng, deu and other latin script based languages then it maybe related to complex script processing / unichar compression / recoding.

Were you able to get the unit tests related to unichar compression to work? Maybe they can help in figuring out the issue.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jan 1, 2019

Please see Ray's comments in #648 (comment) - These are from Jan 2017. He has made changes to the processing for Arabic after that. I will try and find those comments and commits and link here for reference too.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jan 1, 2019

3e63918

2017-09-08 (3e63918) Ray Smith: Fixed order of characters in ligatures of RTL languages issue #648

@Shreeshrii
Copy link
Collaborator

4e8018d

2017-07-19 (4e8018d) Ray Smith: Important fix to RTL languages saves last space on each line, which was previously lost

@jaddoughman
Copy link
Author

Okay, did your attempt at fine tuning work with the given dataset ? Your attempt is important since my extracted traineddata file is reported beta version 3.

@Shreeshrii

Shreeshrii added a commit to Shreeshrii/tessdata_arabic that referenced this issue Jan 1, 2019
400 iterations - Scheherazade font - training text made by concatenating text lines from dataset provided in tesseract-ocr/tesseract#2132 (comment)
Shreeshrii added a commit to Shreeshrii/tessdata_arabic that referenced this issue Jan 1, 2019
PlusMinus training using training text based on dataset in tesseract-ocr/tesseract#2132 (comment)
at 400, 4000 and 10000 iterations
@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jan 1, 2019

@jaddoughman Please see https://github.com/Shreeshrii/tessdata_arabic

I have uploaded there various versions of finetuneddata using the training text based on your dataset.
I have not used the scanned images.

If you know the font used for the newspaper, or a similar font, finetuning with that might give better result.

@jaddoughman
Copy link
Author

I ran all your trained models on 5 testing samples, but the accuracy decreased on each one. The issue is still caused by word detection, since a fine tuned model would never perform worse than the original one. This is unfortunate. If any possible explanation arises concerning the connection between training and segmentation, please let me know.

Thank your for your help. @Shreeshrii

@jaddoughman
Copy link
Author

jaddoughman commented Jan 17, 2019

Our Fine tuned model is performing better on a text line level. Hence, training is improving the accuracy on a text line level. One possible solution I'm exploring, is to segment the newspaper samples into text lines and OCR them using our fine tuned model. The issue here would be that I would need a segmentation algorithm to automate this process.

I created a tool using Tesseract's hOCR files, that parses the hOCR files into bbox and generates the corresponding text line images, however the segmentation was far from perfect. Do you recommend any other way to automatically segment the newspaper samples into text lines ? Or word extraction ? I just need the segmented text lines which can be trasncribed using our fine tuned model.

SAMPLE NEWSPAPER: Sample1.tif.zip

@Shreeshrii @amitdo @stweil

@Shreeshrii
Copy link
Collaborator

As an experiment, try to create the HOCR files using different language traineddata and see if the boxing is better.

Also try with --oem 0 i.e. base tesseract instead of lstm tesseract, and with older versions of tesseract (3.05, 3.04).

It would be good to know whether segmentation is different in all these cases and whether any are better for your use case.

You can also use leptonica directly for segmentation. Please look at the sample programs provided with it, I recall one which had good results for Arabic.

@Shreeshrii
Copy link
Collaborator

Please also see tesseract-ocr/tesstrain#7

@jaddoughman
Copy link
Author

I uploaded below the generated text line images by the Arabic, Arabic Fine-tuned, and English models (using their respective hOCR files). The results are that the English and Arabic text line images differ probably due to writing orientations (RTL and LTR), but the ara and ara_finetuned had the same results. This is what I predicted, but this doesn't lead me anywhere, since we already knew that fine tuning doesn't change on a text line level, but the recognition of words is what differs.

ENGLISH MODEL: Sample1_eng.zip
ARABIC MODEL: Sample1_ara.zip
ARABIC FINE TUNED MODEL: Sample1_ara_finetuned.zip

@Shreeshrii

@Shreeshrii
Copy link
Collaborator

I created a tool using Tesseract's hOCR files, that parses the hOCR files into bbox and generates the corresponding text line images, however the segmentation was far from perfect.

My reasoning for the experiment was that if another model gives you better segmentation, you can use it for splitting to line images and then use your finetuned model to ocr.

@Shreeshrii
Copy link
Collaborator

Also see #657

@jaddoughman
Copy link
Author

I tried all variations of different language models and OEMs. No major difference was found. I think the most reasonable solution would be using Leptonica. However, isn't Tesseract powered by Leptonica ? If so, is it possible to generate different results than the hOCRs files generated by Tesseract ?

@Shreeshrii

@amitdo
Copy link
Collaborator

amitdo commented Jan 17, 2019

#657

@Shreeshrii
Copy link
Collaborator

I just used leptonica/prog/arabic_lines and changed input file name to arabic.png for testing.

The complete newspaper did not work well with it. I cropped a section with two columns.

ubuntu@tesseract-ocr:~/leptonica/prog$ ./arabic_lines
Info in pixRotate: 1 bpp; rotate by shear
Skew angle:   -0.25 degrees;   7.80 conf
Num columns: 2
Num textlines in col 0: 59
Num textlines in col 1: 57
sh: 1: xzgv: not found

Results for that are attached.

arabic.zip

@jaddoughman
Copy link
Author

Thank you for you help. However, isn't Tesseract using the arabic_lines code to segment the inputted image ? If not, what is the code you are using ?

@Shreeshrii

@Shreeshrii
Copy link
Collaborator

isn't Tesseract using the arabic_lines code to segment the inputted image ?

No. Tesseract has its own layout analysis code which may be using other leptonica functions.

@jaddoughman
Copy link
Author

Will you be fixing the issue of fine tuning leading to altered word detection in the coming Tesseract 4.1 updated version ? I believe this is a major obstacle specially in Arabic, since the pre-trained models are performing very bad. Even after you trained using a separate training data-set, the word detection was altered and the accuracy decreased substantially.

If you have any immediate fix or can guide me in a direction that fixes this issue, let me know. I have 185,000 images similar to the ones attached, and my trained model is suffering from the bug discussed above. Thank you for your help.

@Shreeshrii

@Shreeshrii
Copy link
Collaborator

The official traineddata has been trained by Ray Smith at Google. As far as I know there are no new updates planned.

I try to follow the guidelines given by Ray in tesstutorial or comments on issues for experimenting with training.

Regarding layout analysis, there are other similar open issues. I am not sure if there are any plans to address those for 4.1.0.

You can try posting in tesseract-ocr google group to see if someone has had better luck with improving Arabic traineddata.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Mar 10, 2019 via email

@jaddoughman
Copy link
Author

jaddoughman commented Mar 10, 2019

The font family can be easily found, since the images are from a well-known newspaper which uses consistent font families throughout the archive. However, the bigger issue is the altered word detection post-training. You attempted to train on a certain font family, and the results were worse than the pre-trained model. My questions is how is fine-tuning a model decreasing the accuracy ? Also, how is fine-tuning altering the detection of the word itself.

The OCR Process as you know has 4 main steps:

  1. Binarization
  2. Segmentation
  3. Classification
  4. Post-Processing

Word detection occurs prior to the classification of the letters themselves. The generated layout analysis attached above shows an altered and incorrect word detection for the trained model. The above questions should be included in your updated version of Tesseract. OCRing the 185K archive is part of a research paper, investing months to train Tesseract shouldn't go to waste. I have a lot of samples if you wish to experiment on.

@Shreeshrii @amitdo @stweil

@Shreeshrii
Copy link
Collaborator

Google/Ray have not shared the training text used for LSTM training for Arabic, so we only have the 80 lines from langdata repo. Finetuning works best, AFAIK, when the original training text is used with minimal changes. Trying a different text leads to worse results, as you have pointed out.

@zdenop
Copy link
Contributor

zdenop commented Mar 18, 2019

@jaddoughman: As far as I understand Cognitive Services Arabic OCR API is part of Microsoft Computer Vision which is alternative for Cloud Vision and not for tesseract. These kind of services are not free and neither open source.

@amitdo
Copy link
Collaborator

amitdo commented Jun 22, 2020

The issue is still caused by word detection, since a fine tuned model would never perform worse than the original one.

Your assumption is wrong.

#2132 (comment)

As Shree pointed out, you should not train too much lines with the same font. It will lead to overfitting.

@amitdo amitdo unpinned this issue Jun 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants