Fine Tuning Leads to Segmentation Issue #2132

jaddoughman · 2018-12-24T13:41:29Z

Environment

Tesseract Version:
tesseract 4.0.0
leptonica-1.77.0
libjpeg 9c : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11
Found AVX2
Found AVX
Found SSE
Platform:
Ubuntu 16.04

Current Behavior:

I wanted to OCR a large dataset of Arabic newspapers with difficult delimiters and spacing. After running your original pre-trained model, I managed to recall about 80% of the required data. I opted to fine tune your existing ara.traineddata file by using text lines as my training and test data set. I used the "OCR-d Train" tool on GitHub to generate the neccessary .box. files.

Throughout the fine tuning process, the Eval percentages decreased tremendously, which means that the model was successfully trained. I re-evaluated using my own method and confirmed the successful training process.

However, the test dataset used was made up of text lines. So, your and my evaluation were generated on a text line level. The issue occurred when I ran the fine tuned model on a complete Newspaper sample (constituted of the same text line fonts). The accuracy decreased significantly compared to your original pre-trained model. This made no sense at all, my fine tuned model has better accuracy than your model on a text line level, but when running it on a complete newspaper (constituted of the same text line fonts), your pre-trained model is performing better than my successfully fine-tuned model.

The issue seems to be connected to your segmentation algorithm. This is a major problem, since this means that your training tool only works on a text line level and cannot be applied to any other form of dynamic text extraction. You will find below a sample newspaper, my fined tuned model, and the learning curve from the training process.

Sample Newspaper:
Sample Newspaper.zip

Fine Tuned Model:
ara_finetuned.traineddata.zip

Learning Curve:
Learning Curve (60k Iterations).pdf

jaddoughman · 2018-12-29T14:10:35Z

Any idea of what might be causing this issue ?

@amitdo @Shreeshrii

stweil · 2018-12-29T14:49:12Z

Here is a visualisation (using https://github.com/kba/hocrjs) for both results:

The layout recognition is clearly different.

jaddoughman · 2018-12-29T14:58:29Z

What would be the reason behind the different layouts ? Why would my fine tuning have an impact ?

Also, thank you for your support @stweil

Shreeshrii · 2018-12-29T20:44:23Z

Based on recommendations for tess tutorial in wiki by @theraysmith finetuning should only be done for limited number of iterations. He had suggested 400 for finetuning for 'impact' and 3000 for finetuning to add a character. So, 60000 is probably too large.

Also, please check the --psm being used for training by the ocr-d/train script. Ray has mentioned as part of the LSTM training notes that the models have been trained per line.

Shreeshrii · 2018-12-29T20:48:36Z

https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00#integration-with-tesseract

Integration with Tesseract

The Tesseract 4.00 neural network subsystem is integrated into Tesseract as a line recognizer. It can be used with the existing layout analysis to recognize text within a large document, or it can be used in conjunction with an external text detector to recognize text from an image of a single textline.

The neural network engine is the default for 4.00. To recognize text from an image of a single text line, use SetPageSegMode(PSM_RAW_LINE). This can be used from the command-line with -psm 13

The neural network engine has been integrated to enable the multi- language mode that worked with Tesseract 3.04, but this will be improved in a future release. Vertical text is now supported for Chinese, Japanese and Korean, and should be detected automatically.

@stweil Does this mean that layout analysis has changed since tessdata_best was trained?

jaddoughman · 2018-12-29T20:49:16Z

As shown in the learning curve uploaded above, the training process was successful (even for 60k iterations). The accuracy improved on a text line level. My issue as explained above and shown in the layout representations, is that of segmentation. When running the trained model on a complete newspaper, the accuracy goess way off.

Have a look at the layout representations above. I used --psm 7 for training.

@Shreeshrii

stweil · 2018-12-29T21:39:55Z

Does this mean that layout analysis has changed since tessdata_best was trained?

Why do you think so?

stweil · 2018-12-30T11:19:08Z

@jaddoughman, is this result better?

hOCR made with ara_finetuned + config

I added ara.config which was missing in ara_finetuned.traineddata.

stweil · 2018-12-30T11:23:20Z

There are some more components which could be taken from the original ara.traineddata:

ara.lstm-number-dawg
ara.lstm-punc-dawg
ara.lstm-recoder
ara.lstm-unicharset
ara.lstm-word-dawg

jaddoughman · 2018-12-30T18:26:18Z

No, even after adding the dawg files the issue remains. I can't seem to understand how training a model is in any way connected to the segmentation process. The layout representation should be identical in all model or am I wrong ?

@stweil

stweil · 2018-12-30T18:38:16Z

I‌ would have thought so, too, but recently I‌ noticed some cases which are even more strange:

newspaper pages full of text where Tesseract says "empty page"
hOCR with no text while the text output shows text

ara.config changes the segmentation process, so that is something which needs to be added to the documentation: add config from original traineddata to finetuned traineddata.

jaddoughman · 2018-12-30T18:47:16Z

I trained twice, once including the dawg files and the other excluding them. The training which included the dawg file was better then the one excluding it. However, both were way worse than the original model.

Also, note that training was successful (learning curve attached above). On a text line level, the results are near perfect. However, I need the transcription of the complete newspaper sample. This was a part of a 12-month long research paper, to finally reach this issue is devastating.

On a technical level, there needs to be an explanation of why and how training any model would in any way alter the segmentation process.

@stweil

Shreeshrii · 2018-12-30T19:06:42Z

@jaddoughman Which psm are you using for the complete newspaper sample? If is the default i.e. psm 3 then please try the training with --psm 3 (or without specifying the psm) as an experiment and see if the results are better.

jaddoughman · 2018-12-30T23:10:23Z

I attached @Shreeshrii 's fine tuned Arabic model below. Is it possible @stweil to generate its corresponding layout representation ? This can help us reach a conclusive decision on our initial assumption concerning the segmentation issue.

Fine Tuned Model: ara-amiri-3000.traineddata.zip

stweil · 2018-12-31T09:04:19Z

Here it is:

hOCR made with ara-amiri-3000

jaddoughman · 2018-12-31T11:42:45Z

Original Model: https://ub-blade-01.bib.uni-mannheim.de/~stweil/tesseract/issues/2132/Sample1.html
@Shreeshrii 's Fine Tuned Model: https://ub-blade-01.bib.uni-mannheim.de/~stweil/tesseract/issues/2132/Sample1-ara-amiri-3000.html

The above results prove that our assumption concerning the segmentation is true. Any explanation of the relation between fine tuning and word detection (segmentation) would be greatly appreciated. Understanding the problem can help in finding a workaround.

@stweil @theraysmith

amitdo · 2018-12-31T14:21:41Z

The layout analysis phase detects:

tables
columns
blocks (text / image)
text lines

Words and glyphs splitting is part of the OCR phase and not part of the layout analysis phase.

jaddoughman · 2018-12-31T14:26:14Z

Why is fine tuning changing the word recognition ? How can I fix my issue ?

Also, if word splitting occurs at OCR phase, then why do I have different results when running the exact same line on a complete newspaper versus on a text line level ? Meaning: if I OCR a single text line I get a different result than OCRing a complete newspaper containing that text line.

@amitdo

amitdo · 2018-12-31T17:55:05Z

Looking again in the code, it seems that words splitting does occur in the layout analysis phase...

I think the word splitting can still be changed by the ocr phase.

Sorry, I don't have answers to your last questions.

jaddoughman · 2018-12-31T18:06:48Z

Can the code be altered to include splitting in the OCR phase ? I see no reason why word splitting should be altered during training. My training dataset constituted of 4000 text lines that required crowdsourcing to generate. A lot of time was invested to train the model. Any help would be greatly appreciated.

If any of the other developers have an answer I would be happy to try any alternative fix.

@amitdo

amitdo · 2018-12-31T19:23:53Z

Sorry, I don't know how to help you with this issue.

Shreeshrii · 2018-12-31T19:56:22Z

@jaddoughman I unpacked your traineddata file with combine_tessdata. The lstm_unicharset in it has 303 characters. So it seems to me that you have trained using script/Arabic from tessdata_best rather than ara.traineddata.
If that is indeed the case please try starting from ara.traineddata to see if there is any difference.

Also, please share the exact version of tesseract that you are using. Your traineddata file reported beta.3.

jaddoughman · 2018-12-31T20:00:18Z

I trained using both the script and the tessdata_best. Both altered the segmentation leading to the same issue. I was using Tesseract 4.0 during training. However, even if another version was used, i also tried your fine tuned model that also resulted in altered word detection.

Is it possible to alter the code so that the word splitting resides in the layout process and not the OCR one ?

@Shreeshrii

jaddoughman · 2018-12-31T20:07:12Z

If i attach my training data set, is it possible for you to fine tune it to ensure that the issue isn't related to my training process ?

@Shreeshrii

Shreeshrii · 2018-12-31T20:16:19Z

@theraysmith is the only one with enough knowledge about the code to suggest a solution and according to @jbreiden he is now busy on another project at Google.
If you can share your box/tiff pairs I can give a try to fine tuning with it. However, I have to admit that I have not had much success in adding the digits in Arabic script by fine tuning to the traineddata.

jaddoughman · 2018-12-31T20:36:46Z

The below dataset contains about 4000 text lines. The txt files below are in RTL order. I was informed that they needed to be changed to LTR. I attempted to change them to LTR by inverting the string of every text file. The below dataset is in RTL since one of my issue might be caused by my conversion attempt. I fine tuned for 60,000 iterations and saw a great improvement in accuracy on a text line level. I think your training attempt can help us reach a conclusive decision on the origin of the issue.

Thank you for your help.

Dataset: dataset.zip

@Shreeshrii

Shreeshrii · 2019-01-01T13:46:09Z

According to posts by Ray, training for all languages is done in LTR order and there is a routine in tesseract to handle the change to RTL later.

I do not know Arabic hence can not check whether the conversion is correct. I am relying on text2image to create the correct box files.

I have concatenated your text files to create a training_text for fine tuning. I will run the training with Scheherazade font and share the results.

From my earlier experience fine tuning seems to work best when the training text used is what was used for initial training. For Arabic we do not have that file available, we only have the 80 line training_text (similar to 3.04).

Shreeshrii · 2019-01-01T13:59:33Z

@stweil

Does this mean that layout analysis has changed since tessdata_best was trained?

Why do you think so?

No concrete proof :-(

There have been issues with page segmentation, word dropping for a while. There are probably a number of issues related to them still open.

So, something has definitely changed.

If it is not seen in eng, deu and other latin script based languages then it maybe related to complex script processing / unichar compression / recoding.

Were you able to get the unit tests related to unichar compression to work? Maybe they can help in figuring out the issue.

Shreeshrii · 2019-01-01T17:36:10Z

Please see Ray's comments in #648 (comment) - These are from Jan 2017. He has made changes to the processing for Arabic after that. I will try and find those comments and commits and link here for reference too.

Shreeshrii · 2019-01-01T17:50:05Z

3e63918

2017-09-08 (3e63918) Ray Smith: Fixed order of characters in ligatures of RTL languages issue #648

Shreeshrii · 2019-01-01T17:52:00Z

4e8018d

2017-07-19 (4e8018d) Ray Smith: Important fix to RTL languages saves last space on each line, which was previously lost

jaddoughman · 2019-01-01T17:56:00Z

Okay, did your attempt at fine tuning work with the given dataset ? Your attempt is important since my extracted traineddata file is reported beta version 3.

@Shreeshrii

400 iterations - Scheherazade font - training text made by concatenating text lines from dataset provided in tesseract-ocr/tesseract#2132 (comment)

PlusMinus training using training text based on dataset in tesseract-ocr/tesseract#2132 (comment) at 400, 4000 and 10000 iterations

Shreeshrii · 2019-01-01T23:45:51Z

@jaddoughman Please see https://github.com/Shreeshrii/tessdata_arabic

I have uploaded there various versions of finetuneddata using the training text based on your dataset.
I have not used the scanned images.

If you know the font used for the newspaper, or a similar font, finetuning with that might give better result.

jaddoughman · 2019-01-04T08:58:11Z

I ran all your trained models on 5 testing samples, but the accuracy decreased on each one. The issue is still caused by word detection, since a fine tuned model would never perform worse than the original one. This is unfortunate. If any possible explanation arises concerning the connection between training and segmentation, please let me know.

Thank your for your help. @Shreeshrii

jaddoughman · 2019-01-17T12:12:07Z

Our Fine tuned model is performing better on a text line level. Hence, training is improving the accuracy on a text line level. One possible solution I'm exploring, is to segment the newspaper samples into text lines and OCR them using our fine tuned model. The issue here would be that I would need a segmentation algorithm to automate this process.

I created a tool using Tesseract's hOCR files, that parses the hOCR files into bbox and generates the corresponding text line images, however the segmentation was far from perfect. Do you recommend any other way to automatically segment the newspaper samples into text lines ? Or word extraction ? I just need the segmented text lines which can be trasncribed using our fine tuned model.

SAMPLE NEWSPAPER: Sample1.tif.zip

@Shreeshrii @amitdo @stweil

Shreeshrii · 2019-01-17T15:38:00Z

As an experiment, try to create the HOCR files using different language traineddata and see if the boxing is better.

Also try with --oem 0 i.e. base tesseract instead of lstm tesseract, and with older versions of tesseract (3.05, 3.04).

It would be good to know whether segmentation is different in all these cases and whether any are better for your use case.

You can also use leptonica directly for segmentation. Please look at the sample programs provided with it, I recall one which had good results for Arabic.

Shreeshrii · 2019-01-17T15:43:45Z

Please also see tesseract-ocr/tesstrain#7

jaddoughman · 2019-01-17T16:02:42Z

I uploaded below the generated text line images by the Arabic, Arabic Fine-tuned, and English models (using their respective hOCR files). The results are that the English and Arabic text line images differ probably due to writing orientations (RTL and LTR), but the ara and ara_finetuned had the same results. This is what I predicted, but this doesn't lead me anywhere, since we already knew that fine tuning doesn't change on a text line level, but the recognition of words is what differs.

ENGLISH MODEL: Sample1_eng.zip
ARABIC MODEL: Sample1_ara.zip
ARABIC FINE TUNED MODEL: Sample1_ara_finetuned.zip

@Shreeshrii

Shreeshrii · 2019-01-17T16:54:41Z

I created a tool using Tesseract's hOCR files, that parses the hOCR files into bbox and generates the corresponding text line images, however the segmentation was far from perfect.

My reasoning for the experiment was that if another model gives you better segmentation, you can use it for splitting to line images and then use your finetuned model to ocr.

Shreeshrii · 2019-01-17T16:57:55Z

Also see #657

jaddoughman · 2019-01-17T19:27:04Z

I tried all variations of different language models and OEMs. No major difference was found. I think the most reasonable solution would be using Leptonica. However, isn't Tesseract powered by Leptonica ? If so, is it possible to generate different results than the hOCRs files generated by Tesseract ?

@Shreeshrii

amitdo · 2019-01-17T20:14:05Z

#657

Shreeshrii · 2019-02-12T11:14:53Z

I just used leptonica/prog/arabic_lines and changed input file name to arabic.png for testing.

The complete newspaper did not work well with it. I cropped a section with two columns.

ubuntu@tesseract-ocr:~/leptonica/prog$ ./arabic_lines
Info in pixRotate: 1 bpp; rotate by shear
Skew angle:   -0.25 degrees;   7.80 conf
Num columns: 2
Num textlines in col 0: 59
Num textlines in col 1: 57
sh: 1: xzgv: not found

Results for that are attached.

arabic.zip

jaddoughman · 2019-02-12T14:25:38Z

Thank you for you help. However, isn't Tesseract using the arabic_lines code to segment the inputted image ? If not, what is the code you are using ?

@Shreeshrii

Shreeshrii · 2019-02-12T14:44:53Z

isn't Tesseract using the arabic_lines code to segment the inputted image ?

No. Tesseract has its own layout analysis code which may be using other leptonica functions.

jaddoughman · 2019-03-10T08:10:18Z

Will you be fixing the issue of fine tuning leading to altered word detection in the coming Tesseract 4.1 updated version ? I believe this is a major obstacle specially in Arabic, since the pre-trained models are performing very bad. Even after you trained using a separate training data-set, the word detection was altered and the accuracy decreased substantially.

If you have any immediate fix or can guide me in a direction that fixes this issue, let me know. I have 185,000 images similar to the ones attached, and my trained model is suffering from the bug discussed above. Thank you for your help.

@Shreeshrii

Shreeshrii · 2019-03-10T08:22:51Z

The official traineddata has been trained by Ray Smith at Google. As far as I know there are no new updates planned.

I try to follow the guidelines given by Ray in tesstutorial or comments on issues for experimenting with training.

Regarding layout analysis, there are other similar open issues. I am not sure if there are any plans to address those for 4.1.0.

You can try posting in tesseract-ocr google group to see if someone has had better luck with improving Arabic traineddata.

Shreeshrii · 2019-03-10T08:36:29Z

Do you know which font is used in the images that you want to recognize? Or suggest a similar font.

…

On Sun, Mar 10, 2019 at 1:40 PM Jad Doughman ***@***.***> wrote: Will you be fixing the issue of fine tuning leading to altered word detection in the coming Tesseract 4.1 updated version ? I believe this is a major obstacle specially in Arabic, since the pre-trained models are performing very bad. Even after you trained using a separate training data-set, the word detection was altered and the accuracy decreased substantially. If you have any immediate fix or can guide me in a direction that fixes this issue, let me know. I have 185,000 images similar to the ones attached, and my trained model is suffering from the bug discussed above. Thank you for your help. @Shreeshrii <https://github.com/Shreeshrii> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2132 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o8c5iri1zILCMme7qQUYCRt9Mvrdks5vVL35gaJpZM4ZgdTt> .

--

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

jaddoughman · 2019-03-10T08:46:30Z

The font family can be easily found, since the images are from a well-known newspaper which uses consistent font families throughout the archive. However, the bigger issue is the altered word detection post-training. You attempted to train on a certain font family, and the results were worse than the pre-trained model. My questions is how is fine-tuning a model decreasing the accuracy ? Also, how is fine-tuning altering the detection of the word itself.

The OCR Process as you know has 4 main steps:

Binarization
Segmentation
Classification
Post-Processing

Word detection occurs prior to the classification of the letters themselves. The generated layout analysis attached above shows an altered and incorrect word detection for the trained model. The above questions should be included in your updated version of Tesseract. OCRing the 185K archive is part of a research paper, investing months to train Tesseract shouldn't go to waste. I have a lot of samples if you wish to experiment on.

@Shreeshrii @amitdo @stweil

Shreeshrii · 2019-03-10T09:01:29Z

Google/Ray have not shared the training text used for LSTM training for Arabic, so we only have the 80 lines from langdata repo. Finetuning works best, AFAIK, when the original training text is used with minimal changes. Trying a different text leads to worse results, as you have pointed out.

zdenop · 2019-03-18T10:43:04Z

@jaddoughman: As far as I understand Cognitive Services Arabic OCR API is part of Microsoft Computer Vision which is alternative for Cloud Vision and not for tesseract. These kind of services are not free and neither open source.

amitdo · 2020-06-22T03:32:44Z

The issue is still caused by word detection, since a fine tuned model would never perform worse than the original one.

Your assumption is wrong.

#2132 (comment)

As Shree pointed out, you should not train too much lines with the same font. It will lead to overfitting.

stweil pinned this issue Dec 29, 2018

stweil added the accuracy label Dec 29, 2018

jaddoughman closed this as completed Jan 17, 2019

jaddoughman reopened this Jan 17, 2019

amitdo added the training label May 18, 2020

amitdo unpinned this issue Jun 22, 2020

Fine Tuning Leads to Segmentation Issue #2132

Fine Tuning Leads to Segmentation Issue #2132

Comments

jaddoughman commented Dec 24, 2018

Environment

Current Behavior:

jaddoughman commented Dec 29, 2018

stweil commented Dec 29, 2018

jaddoughman commented Dec 29, 2018 • edited Loading

Shreeshrii commented Dec 29, 2018 • edited Loading

Shreeshrii commented Dec 29, 2018

jaddoughman commented Dec 29, 2018

stweil commented Dec 29, 2018

stweil commented Dec 30, 2018 • edited Loading

stweil commented Dec 30, 2018

jaddoughman commented Dec 30, 2018

stweil commented Dec 30, 2018

jaddoughman commented Dec 30, 2018 • edited Loading

Shreeshrii commented Dec 30, 2018 • edited Loading

jaddoughman commented Dec 30, 2018

stweil commented Dec 31, 2018

jaddoughman commented Dec 31, 2018

amitdo commented Dec 31, 2018

jaddoughman commented Dec 31, 2018 • edited Loading

amitdo commented Dec 31, 2018

jaddoughman commented Dec 31, 2018

amitdo commented Dec 31, 2018

Shreeshrii commented Dec 31, 2018

jaddoughman commented Dec 31, 2018

jaddoughman commented Dec 31, 2018 • edited Loading

Shreeshrii commented Dec 31, 2018

jaddoughman commented Dec 31, 2018

Shreeshrii commented Jan 1, 2019

Shreeshrii commented Jan 1, 2019

Shreeshrii commented Jan 1, 2019 • edited Loading

Shreeshrii commented Jan 1, 2019 • edited Loading

Shreeshrii commented Jan 1, 2019

jaddoughman commented Jan 1, 2019

Shreeshrii commented Jan 1, 2019 • edited Loading

jaddoughman commented Jan 4, 2019

jaddoughman commented Jan 17, 2019 • edited Loading

Shreeshrii commented Jan 17, 2019

Shreeshrii commented Jan 17, 2019

jaddoughman commented Jan 17, 2019

Shreeshrii commented Jan 17, 2019

Shreeshrii commented Jan 17, 2019

jaddoughman commented Jan 17, 2019

amitdo commented Jan 17, 2019

Shreeshrii commented Feb 12, 2019

jaddoughman commented Feb 12, 2019

Shreeshrii commented Feb 12, 2019

jaddoughman commented Mar 10, 2019

Shreeshrii commented Mar 10, 2019

Shreeshrii commented Mar 10, 2019 via email

jaddoughman commented Mar 10, 2019 • edited Loading

Shreeshrii commented Mar 10, 2019

zdenop commented Mar 18, 2019

amitdo commented Jun 22, 2020 • edited Loading

jaddoughman commented Dec 29, 2018 •

edited

Loading

Shreeshrii commented Dec 29, 2018 •

edited

Loading

stweil commented Dec 30, 2018 •

edited

Loading

jaddoughman commented Dec 30, 2018 •

edited

Loading

Shreeshrii commented Dec 30, 2018 •

edited

Loading

jaddoughman commented Dec 31, 2018 •

edited

Loading

jaddoughman commented Dec 31, 2018 •

edited

Loading

Shreeshrii commented Jan 1, 2019 •

edited

Loading

Shreeshrii commented Jan 1, 2019 •

edited

Loading

Shreeshrii commented Jan 1, 2019 •

edited

Loading

jaddoughman commented Jan 17, 2019 •

edited

Loading

jaddoughman commented Mar 10, 2019 •

edited

Loading

amitdo commented Jun 22, 2020 •

edited

Loading