Letters split in multiple parts #1778

lorenzob · 2018-07-12T17:45:18Z

Hi,
I have a small problem with some letters that are recognized as multiple letters.

tesseract 4.0.0-beta.3-56-g5fda
leptonica-1.76.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found SSE

This is a sample (I can reproduce the problem with this image and eng "_best", md5: 4be3f51b55c0074d8c6b1ee5b5100f95):

output is: 17AE4L4

The 4 is seen as three different letters. Maybe the shape of the 4 is not so common and this is creating the problem. This is from an MRZ code, but I'm seeing this in normal text too.

This is how tesseract sees the image (data is taken from the bounding boxes):

(cyan and magenta lines on top/bottom are there just to better visualize the letters switch, the colors too have no special meaning).

I'm wondering if there is anything I can do to fix this other than training a custom model on this font (it is part of an mrz, btw).

Even a small edit to the image, like cropping, makes the problem appear or disappear. The output for this other sample is : 17AESL

Are there any parameters like minimum box size, split threshold, something I can ask the iterator, etc. that might help? Or is everything part of the lstm?

In general it seems like it is very sensible to horizontal translation, I'm seeing this, less often, even with fine tuned models.

I tried to do some data augmentation varying the amount of white spaces on the left, from 0 to 5px, but got confusing results (to little data to be sure if it was good or bad). Do you think this may be a valid strategy?
Do you think training from scratch may help?

Thanks for any advice.

Shreeshrii · 2018-07-12T18:06:10Z

I would suggest finetune with an MRZ font.

…

On Thu, Jul 12, 2018 at 11:15 PM lorenzob ***@***.***> wrote: Hi, I have a small problem with some letters that are recognized as multiple letters. tesseract 4.0.0-beta.3-56-g5fda leptonica-1.76.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0 Found AVX2 Found AVX Found SSE This is a sample (I can reproduce the problem with this image and eng "_best", md5: 4be3f51b55c0074d8c6b1ee5b5100f95): [image: split2] <https://user-images.githubusercontent.com/85481/42649738-c37325f6-860a-11e8-9771-5cd78210bbaa.png> output is: 17AE4L4 The 4 is seen as three different letters. Maybe the shape of the 4 is not so common and this is creating the problem. This is from an MRZ code, but I'm seeing this in normal text too. This is how tesseract sees the image (data is taken from the bounding boxes): [image: ocr_boxes_split2] <https://user-images.githubusercontent.com/85481/42649773-db00f41e-860a-11e8-9053-b79041535929.png> (cyan and magenta lines on top/bottom are there just to better visualize the letters switch, the colors too have no special meaning). I'm wondering if there is anything I can do to fix this other than training a custom model on this font (it is part of an mrz, btw). Even a small edit to the image, like cropping, makes the problem appear or disappear. The output for this other sample is : 17AESL [image: split1] <https://user-images.githubusercontent.com/85481/42649750-ccb9cdd6-860a-11e8-9283-b827db17be0c.png> [image: ocr_boxes_split1] <https://user-images.githubusercontent.com/85481/42649785-e0d56910-860a-11e8-8255-12f572ef1a35.png> Are there any parameters like minimum box size, split threshold, something I can ask the iterator, etc. that might help? Or is everything part of the lstm? In general it seems like it is very sensible to horizontal translation, I'm seeing this, less often, even with fine tuned models. I tried to do some data augmentation varying the amount of white spaces on the left, from 0 to 5px, but got confusing results (to little data to be sure if it was good or bad). Do you think this may be a valid strategy? Do you think training from scratch may help? Thanks for any advice. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1778>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_ozhbkqOld_fhmS-Ths1th7iJNzL3ks5uF4tMgaJpZM4VNcSr> .

--

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

lorenzob · 2018-07-12T19:50:29Z

Thanks, I'll try that too. This time I did fine tuning but I used real images and I do not have many of these.

Do you know any ready to run script/tool that I can use for the font/labels generation? I may even start the training from scratch, do you think may be better? It's ok to let it run for a couple of days if needed.

I'm also trying to understand if this is a common issue or not, I'm seeing this quite often even with more classic fonts with normal spacing.

Shreeshrii · 2018-07-13T04:08:34Z

Do you know any ready to run script/tool that I can use for the font/labels generation?

text2image
jtessboxeditor

I'm also trying to understand if this is a common issue or not, I'm seeing this quite often even with more classic fonts with normal spacing.

Ray's training is done on thousands of fonts (as per his comments in wiki and github issues). The accuracy on uncommon fonts maybe lower than overall level.

My suggestion will be to finetune, just 300-400 iterations.

Shreeshrii · 2018-07-13T05:01:17Z

It could also be an issue related to the last letter on a line.

lorenzob · 2018-07-13T09:52:08Z

Thanks, I had forgotten about text2image.

I think the number of training iterations depends on the size of the training set. How many lines/pages do you have in mind when you say 300/400 iterations?

Shreeshrii · 2018-07-13T11:54:38Z

Ray's recommendation for finetune for new font (no other changes) for English - training text about 80 lines, iterations 300.

zdenop · 2018-09-30T15:28:24Z

@lorenzob : is the issue solved for you?

Shreeshrii · 2018-10-02T01:19:07Z

The image gives accurate results with eng from tessdata_fast .

bash ./splitletters.sh

 *****  ./splitletters.png LANG eng TESSDATA tessdata OEM 1 PSM 3 ****
17AE4L

 *****  ./splitletters.png LANG eng TESSDATA tessdata OEM 1 PSM 6 ****
17AE4L

 *****  ./splitletters.png LANG eng TESSDATA tessdata_best OEM 1 PSM 3 ****
17AE4L

 *****  ./splitletters.png LANG eng TESSDATA tessdata_best OEM 1 PSM 6 ****
17AE4L

 *****  ./splitletters.png LANG eng TESSDATA tessdata_fast OEM 1 PSM 3 ****
17AE4

 *****  ./splitletters.png LANG eng TESSDATA tessdata_fast OEM 1 PSM 6 ****
17AE4

 *****  ./splitletters.png SCRIPT Latin TESSDATA tessdata_best OEM 1 PSM 3 ****
17ZAE4

 *****  ./splitletters.png SCRIPT Latin TESSDATA tessdata_best OEM 1 PSM 6 ****
17ZAE4

 *****  ./splitletters.png SCRIPT Latin TESSDATA tessdata_fast OEM 1 PSM 3 ****
17AE4

 *****  ./splitletters.png SCRIPT Latin TESSDATA tessdata_fast OEM 1 PSM 6 ****
17AE4

Shreeshrii · 2018-10-02T01:21:53Z

Also get correct results with --oem 0 and eng.traineddata from tessdata.


 *****  ./splitletters.png LANG eng TESSDATA tessdata OEM 0 PSM 3 ****
17AE4

 *****  ./splitletters.png LANG eng TESSDATA tessdata OEM 0 PSM 6 ****
17AE4

lorenzob · 2018-10-02T15:27:00Z

I'm using PSM.RAW_LINE, number 13 I think. I remember reading somewhere that it was the best one for a single line of text with tesseract 4 (but maybe it was wrong, outdated or for a different context).
I'll try the other psm modes.

Please give me a few days, my notebook died a couple of days ago and I'm still setting up a new one.

zdenop · 2018-10-13T10:02:23Z

For psm 8, 13 result is - 17AE4L4
For psm 3, 4, 5, 6, 7, 9, 10, 11, 12 result is 17AE4L with tessdata and tessdata_best

lorenzob · 2018-10-13T16:39:31Z

Ok, I tried PSM 7 and 13 on the 4 datasets I'm using with 4 fine tuned models. With two the PSM 7 made no difference at all, with the other two it gave a very small improvement.

I get the same results as zdenop with the two original test images and the eng models.

So it seems like PSM 13 is not a good idea.

Shree and zdenop thanks for your help.

lorenzob · 2018-10-24T20:41:42Z

I discovered why I was using psm=13. This image works perfectly with psm 13 and a custom trained model:

but it returns nothing at all if I use psm 6 or 7. I tried to set different dpi on the image and to add white borders on top and bottom to try to convince tesseract that this is not empty but it keep ignoring it.

The only workaround I found is to add the letter "A" in front of the line and later discard it but it is not something I'd like to do as a solution.

Is there something I can do to keep the benefits of psm 7 and 13?

zdenop added the accuracy label Oct 13, 2018

lorenzob closed this as completed Oct 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Letters split in multiple parts #1778

Letters split in multiple parts #1778

lorenzob commented Jul 12, 2018

Shreeshrii commented Jul 12, 2018 via email

lorenzob commented Jul 12, 2018

Shreeshrii commented Jul 13, 2018

Shreeshrii commented Jul 13, 2018

lorenzob commented Jul 13, 2018

Shreeshrii commented Jul 13, 2018

zdenop commented Sep 30, 2018

Shreeshrii commented Oct 2, 2018

Shreeshrii commented Oct 2, 2018

lorenzob commented Oct 2, 2018

zdenop commented Oct 13, 2018

lorenzob commented Oct 13, 2018

lorenzob commented Oct 24, 2018 •

edited

Loading

Letters split in multiple parts #1778

Letters split in multiple parts #1778

Comments

lorenzob commented Jul 12, 2018

Shreeshrii commented Jul 12, 2018 via email

lorenzob commented Jul 12, 2018

Shreeshrii commented Jul 13, 2018

Shreeshrii commented Jul 13, 2018

lorenzob commented Jul 13, 2018

Shreeshrii commented Jul 13, 2018

zdenop commented Sep 30, 2018

Shreeshrii commented Oct 2, 2018

Shreeshrii commented Oct 2, 2018

lorenzob commented Oct 2, 2018

zdenop commented Oct 13, 2018

lorenzob commented Oct 13, 2018

lorenzob commented Oct 24, 2018 • edited Loading

lorenzob commented Oct 24, 2018 •

edited

Loading