Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Letters split in multiple parts #1778

Closed
lorenzob opened this issue Jul 12, 2018 · 13 comments
Closed

Letters split in multiple parts #1778

lorenzob opened this issue Jul 12, 2018 · 13 comments
Labels

Comments

@lorenzob
Copy link

Hi,
I have a small problem with some letters that are recognized as multiple letters.

tesseract 4.0.0-beta.3-56-g5fda
leptonica-1.76.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found SSE

This is a sample (I can reproduce the problem with this image and eng "_best", md5: 4be3f51b55c0074d8c6b1ee5b5100f95):

split2

output is: 17AE4L4

The 4 is seen as three different letters. Maybe the shape of the 4 is not so common and this is creating the problem. This is from an MRZ code, but I'm seeing this in normal text too.

This is how tesseract sees the image (data is taken from the bounding boxes):

ocr_boxes_split2

(cyan and magenta lines on top/bottom are there just to better visualize the letters switch, the colors too have no special meaning).

I'm wondering if there is anything I can do to fix this other than training a custom model on this font (it is part of an mrz, btw).

Even a small edit to the image, like cropping, makes the problem appear or disappear. The output for this other sample is : 17AESL

split1
ocr_boxes_split1

Are there any parameters like minimum box size, split threshold, something I can ask the iterator, etc. that might help? Or is everything part of the lstm?

In general it seems like it is very sensible to horizontal translation, I'm seeing this, less often, even with fine tuned models.

I tried to do some data augmentation varying the amount of white spaces on the left, from 0 to 5px, but got confusing results (to little data to be sure if it was good or bad). Do you think this may be a valid strategy?
Do you think training from scratch may help?

Thanks for any advice.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jul 12, 2018 via email

@lorenzob
Copy link
Author

Thanks, I'll try that too. This time I did fine tuning but I used real images and I do not have many of these.

Do you know any ready to run script/tool that I can use for the font/labels generation? I may even start the training from scratch, do you think may be better? It's ok to let it run for a couple of days if needed.

I'm also trying to understand if this is a common issue or not, I'm seeing this quite often even with more classic fonts with normal spacing.

@Shreeshrii
Copy link
Collaborator

Do you know any ready to run script/tool that I can use for the font/labels generation?

text2image
jtessboxeditor

I'm also trying to understand if this is a common issue or not, I'm seeing this quite often even with more classic fonts with normal spacing.

Ray's training is done on thousands of fonts (as per his comments in wiki and github issues). The accuracy on uncommon fonts maybe lower than overall level.

My suggestion will be to finetune, just 300-400 iterations.

@Shreeshrii
Copy link
Collaborator

It could also be an issue related to the last letter on a line.

@lorenzob
Copy link
Author

Thanks, I had forgotten about text2image.

I think the number of training iterations depends on the size of the training set. How many lines/pages do you have in mind when you say 300/400 iterations?

@Shreeshrii
Copy link
Collaborator

Ray's recommendation for finetune for new font (no other changes) for English - training text about 80 lines, iterations 300.

@zdenop
Copy link
Contributor

zdenop commented Sep 30, 2018

@lorenzob : is the issue solved for you?

@Shreeshrii
Copy link
Collaborator

The image gives accurate results with eng from tessdata_fast .

bash ./splitletters.sh

 *****  ./splitletters.png LANG eng TESSDATA tessdata OEM 1 PSM 3 ****
17AE4L

 *****  ./splitletters.png LANG eng TESSDATA tessdata OEM 1 PSM 6 ****
17AE4L

 *****  ./splitletters.png LANG eng TESSDATA tessdata_best OEM 1 PSM 3 ****
17AE4L

 *****  ./splitletters.png LANG eng TESSDATA tessdata_best OEM 1 PSM 6 ****
17AE4L

 *****  ./splitletters.png LANG eng TESSDATA tessdata_fast OEM 1 PSM 3 ****
17AE4

 *****  ./splitletters.png LANG eng TESSDATA tessdata_fast OEM 1 PSM 6 ****
17AE4

 *****  ./splitletters.png SCRIPT Latin TESSDATA tessdata_best OEM 1 PSM 3 ****
17ZAE4

 *****  ./splitletters.png SCRIPT Latin TESSDATA tessdata_best OEM 1 PSM 6 ****
17ZAE4

 *****  ./splitletters.png SCRIPT Latin TESSDATA tessdata_fast OEM 1 PSM 3 ****
17AE4

 *****  ./splitletters.png SCRIPT Latin TESSDATA tessdata_fast OEM 1 PSM 6 ****
17AE4

@Shreeshrii
Copy link
Collaborator

Also get correct results with --oem 0 and eng.traineddata from tessdata.


 *****  ./splitletters.png LANG eng TESSDATA tessdata OEM 0 PSM 3 ****
17AE4

 *****  ./splitletters.png LANG eng TESSDATA tessdata OEM 0 PSM 6 ****
17AE4

@lorenzob
Copy link
Author

lorenzob commented Oct 2, 2018

I'm using PSM.RAW_LINE, number 13 I think. I remember reading somewhere that it was the best one for a single line of text with tesseract 4 (but maybe it was wrong, outdated or for a different context).
I'll try the other psm modes.

Please give me a few days, my notebook died a couple of days ago and I'm still setting up a new one.

@zdenop
Copy link
Contributor

zdenop commented Oct 13, 2018

For psm 8, 13 result is - 17AE4L4
For psm 3, 4, 5, 6, 7, 9, 10, 11, 12 result is 17AE4L with tessdata and tessdata_best

@lorenzob
Copy link
Author

Ok, I tried PSM 7 and 13 on the 4 datasets I'm using with 4 fine tuned models. With two the PSM 7 made no difference at all, with the other two it gave a very small improvement.

I get the same results as zdenop with the two original test images and the eng models.

So it seems like PSM 13 is not a good idea.

Shree and zdenop thanks for your help.

@lorenzob
Copy link
Author

lorenzob commented Oct 24, 2018

I discovered why I was using psm=13. This image works perfectly with psm 13 and a custom trained model:

line

but it returns nothing at all if I use psm 6 or 7. I tried to set different dpi on the image and to add white borders on top and bottom to try to convince tesseract that this is not empty but it keep ignoring it.

The only workaround I found is to add the letter "A" in front of the line and later discard it but it is not something I'd like to do as a solution.

Is there something I can do to keep the benefits of psm 7 and 13?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants