-
Notifications
You must be signed in to change notification settings - Fork 9.4k
-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Letters split in multiple parts #1778
Comments
I would suggest finetune with an MRZ font.
…On Thu, Jul 12, 2018 at 11:15 PM lorenzob ***@***.***> wrote:
Hi,
I have a small problem with some letters that are recognized as multiple
letters.
tesseract 4.0.0-beta.3-56-g5fda
leptonica-1.76.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff
4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found SSE
This is a sample (I can reproduce the problem with this image and eng
"_best", md5: 4be3f51b55c0074d8c6b1ee5b5100f95):
[image: split2]
<https://user-images.githubusercontent.com/85481/42649738-c37325f6-860a-11e8-9771-5cd78210bbaa.png>
output is: 17AE4L4
The 4 is seen as three different letters. Maybe the shape of the 4 is not
so common and this is creating the problem. This is from an MRZ code, but
I'm seeing this in normal text too.
This is how tesseract sees the image (data is taken from the bounding
boxes):
[image: ocr_boxes_split2]
<https://user-images.githubusercontent.com/85481/42649773-db00f41e-860a-11e8-9053-b79041535929.png>
(cyan and magenta lines on top/bottom are there just to better visualize
the letters switch, the colors too have no special meaning).
I'm wondering if there is anything I can do to fix this other than
training a custom model on this font (it is part of an mrz, btw).
Even a small edit to the image, like cropping, makes the problem appear or
disappear. The output for this other sample is : 17AESL
[image: split1]
<https://user-images.githubusercontent.com/85481/42649750-ccb9cdd6-860a-11e8-9283-b827db17be0c.png>
[image: ocr_boxes_split1]
<https://user-images.githubusercontent.com/85481/42649785-e0d56910-860a-11e8-8255-12f572ef1a35.png>
Are there any parameters like minimum box size, split threshold, something
I can ask the iterator, etc. that might help? Or is everything part of the
lstm?
In general it seems like it is very sensible to horizontal translation,
I'm seeing this, less often, even with fine tuned models.
I tried to do some data augmentation varying the amount of white spaces on
the left, from 0 to 5px, but got confusing results (to little data to be
sure if it was good or bad). Do you think this may be a valid strategy?
Do you think training from scratch may help?
Thanks for any advice.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1778>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AE2_ozhbkqOld_fhmS-Ths1th7iJNzL3ks5uF4tMgaJpZM4VNcSr>
.
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
|
Thanks, I'll try that too. This time I did fine tuning but I used real images and I do not have many of these. Do you know any ready to run script/tool that I can use for the font/labels generation? I may even start the training from scratch, do you think may be better? It's ok to let it run for a couple of days if needed. I'm also trying to understand if this is a common issue or not, I'm seeing this quite often even with more classic fonts with normal spacing. |
text2image
Ray's training is done on thousands of fonts (as per his comments in wiki and github issues). The accuracy on uncommon fonts maybe lower than overall level. My suggestion will be to finetune, just 300-400 iterations. |
It could also be an issue related to the last letter on a line. |
Thanks, I had forgotten about text2image. I think the number of training iterations depends on the size of the training set. How many lines/pages do you have in mind when you say 300/400 iterations? |
Ray's recommendation for finetune for new font (no other changes) for English - training text about 80 lines, iterations 300. |
@lorenzob : is the issue solved for you? |
The image gives accurate results with eng from tessdata_fast .
|
Also get correct results with --oem 0 and eng.traineddata from tessdata.
|
I'm using PSM.RAW_LINE, number 13 I think. I remember reading somewhere that it was the best one for a single line of text with tesseract 4 (but maybe it was wrong, outdated or for a different context). Please give me a few days, my notebook died a couple of days ago and I'm still setting up a new one. |
For psm 8, 13 result is |
Ok, I tried PSM 7 and 13 on the 4 datasets I'm using with 4 fine tuned models. With two the PSM 7 made no difference at all, with the other two it gave a very small improvement. I get the same results as zdenop with the two original test images and the eng models. So it seems like PSM 13 is not a good idea. Shree and zdenop thanks for your help. |
I discovered why I was using psm=13. This image works perfectly with psm 13 and a custom trained model: but it returns nothing at all if I use psm 6 or 7. I tried to set different dpi on the image and to add white borders on top and bottom to try to convince tesseract that this is not empty but it keep ignoring it. The only workaround I found is to add the letter "A" in front of the line and later discard it but it is not something I'd like to do as a solution. Is there something I can do to keep the benefits of psm 7 and 13? |
Hi,
I have a small problem with some letters that are recognized as multiple letters.
tesseract 4.0.0-beta.3-56-g5fda
leptonica-1.76.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found SSE
This is a sample (I can reproduce the problem with this image and eng "_best", md5: 4be3f51b55c0074d8c6b1ee5b5100f95):
output is: 17AE4L4
The 4 is seen as three different letters. Maybe the shape of the 4 is not so common and this is creating the problem. This is from an MRZ code, but I'm seeing this in normal text too.
This is how tesseract sees the image (data is taken from the bounding boxes):
(cyan and magenta lines on top/bottom are there just to better visualize the letters switch, the colors too have no special meaning).
I'm wondering if there is anything I can do to fix this other than training a custom model on this font (it is part of an mrz, btw).
Even a small edit to the image, like cropping, makes the problem appear or disappear. The output for this other sample is : 17AESL
Are there any parameters like minimum box size, split threshold, something I can ask the iterator, etc. that might help? Or is everything part of the lstm?
In general it seems like it is very sensible to horizontal translation, I'm seeing this, less often, even with fine tuned models.
I tried to do some data augmentation varying the amount of white spaces on the left, from 0 to 5px, but got confusing results (to little data to be sure if it was good or bad). Do you think this may be a valid strategy?
Do you think training from scratch may help?
Thanks for any advice.
The text was updated successfully, but these errors were encountered: