Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tesseract does not recognize letters of good quality #3858

Open
Golddouble opened this issue Jul 4, 2022 · 4 comments
Open

tesseract does not recognize letters of good quality #3858

Golddouble opened this issue Jul 4, 2022 · 4 comments

Comments

@Golddouble
Copy link

Golddouble commented Jul 4, 2022

Environment

  • Tesseract Version: 4.0.0-2
  • Platform: Linux mx 4.19.0-20-amd64 SMP Debian 4.19.235-1 (2022-03-17) x86_64 GNU/Linux

My observation is about the following image:
1_0

Current Behavior:

In
k20220704-211236
tessarect does not recognize the following letters:
k20220704-211337

Expected Behavior:

I had expected, that tesseract would recognize this letters, because the quality of this letters is quite good.
I am just surprised, that it does not recognize them. What do I miss?
Do you have an explanation for this behaviour?

Would be very interested in an answer.

Thank you.

@stweil
Copy link
Contributor

stweil commented Jul 4, 2022

Either try to preprocess the image before running (latest, not a rather old) Tesseract. Or use a different software.

Tesseract works best with black (or at least grey) letters on white background. It can also handle inverted line images with white letters on black background. The current code does not handle lines with a mix of normal and inverted text.

Generally layout and line detection with algorithms seems to be difficult, and Tesseract does not work nearly perfectly for complex layouts. Other modern software uses trained neural networks for layout detection and might give better results. Tesseract still misses that, mainly because of missing developer resources.

@stweil
Copy link
Contributor

stweil commented Jul 5, 2022

Some of the problems reported here might be fixed with draft pull request #3857:

$ tesseract https://user-images.githubusercontent.com/13977359/177207586-20a5f497-8ece-486b-ae0d-2d515fe7a226.jpeg - -l tessdata_fast/script/Latin -c preprocess_graynorm_mode=1
Estimating resolution as 361
1
Verschiedene Freizeitaktivitäten kennen und beschreiben

----- > DW: Die Schrifterkennung muss verbessert werden: hochgestellte Zahlen werden nicht erkannt.
Text in farbigen Kästen weren nicht erkannt. Fette Schrift wird nicht erkannt.

Les loisirs: stress ou détente?

D D 3 A |
m =A
Èo E / LF
® E S
5 i z ; ARS
a Z LS
samedi, 9 h
TD g
z ;
$.
>g Z
ok So
D
E
D
L72]

səınəy g, 'Ipəðwes

2
3
w
m)
(m)
m
p
w
o
(g)
cC
D
(a
2
3] a faire une promenade avec le chien
E. ` `
2 b aller à la bourse des timbres
1 c lire le journal du dimanche
(e) P
z d rencontrer des copains
S| e écrire une lettre
Lea]
A f aller en boîte
3 g bricoler
5 r
a h jouer aux cartes
i] . . `
a i faire un tour dans le quartier
(©) . PE .
7 j aller au cinéma avec ses copains
A k garder les enfants de la voisine
K2] ‘i `
| écouter de la musique
m jouer du piano
n regarder une vidéo
o jouer de la trompette
E p aller au match de foot
e q jouer au tennis avec son père
3 r faire les courses avec papa
a s montrer les timbres à sa mère
t rester au lit jusqu'à 10 heures

@Golddouble
Copy link
Author

Golddouble commented Jul 5, 2022

Is this with Tess a er e act 5?

@stweil
Copy link
Contributor

stweil commented Jul 5, 2022

No, it is with Tesseract 5. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants