-
Notifications
You must be signed in to change notification settings - Fork 887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct handling of TM sign #63
Comments
Copied from: issue tesseract-ocr/tesseract#761the trademark symbol is still not recognized properly. With the newly generated traineddata the symbol is recognized as TM Looking at Latin.unicharset I see that the normalized form of these is just the regular numbers or TM. @theraysmith Does this need to be changed? ™ 0 63,201,209,255,101,273,0,59,104,293 Common 1496 10 1496 TM # ™ [2122 ] |
After thinkiing about this carefully, I decided to undo a change I had made
for the LSTM engine, and better solve tatweel.
The fi/fl ligatures will no longer be included in unicharsets, but will
still be included in the training text, by replacing them with fi/fl pairs
at the same time that tatweel is deleted.
This allows the output to be un-normalized, shaped quotes to be brought
back, and the TM symbol recognized as a single character.
It doesn't help with the sub/superscript problem, and I have another idea
that I want to try that is more important first...
…On Fri, Mar 31, 2017 at 2:10 AM, Shreeshrii ***@***.***> wrote:
Copied from: issue tesseract-ocr/tesseract#761
<tesseract-ocr/tesseract#761>
https://groups.google.com/forum/?utm_medium=email&utm_
source=footer#!msg/tesseract-ocr/JvEF7f0KU8I/La50m7SzEgAJ
the trademark symbol is still not recognized properly. With the newly
generated traineddata the symbol is recognized as TM
Looking at Latin.unicharset I see that the normalized form of these is
just the regular numbers or TM.
@theraysmith <https://github.com/theraysmith> Does this need to be
changed?
™ 0 63,201,209,255,101,273,0,59,104,293 Common 1496 10 1496 TM # ™ [2122 ]
² 0 3,192,209,255,50,248,0,105,0,293 Common 1090 2 1090 2 # ² [b2 ]
³ 0 0,192,209,255,48,268,0,99,0,293 Common 1091 2 1091 3 # ³ [b3 ]
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#63 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056S7lNKS0yOgG9kjZBsmxKT7ziM9Dks5rrMMRgaJpZM4MvZ7k>
.
--
Ray.
|
theraysmith commented
|
tesseract-ocr/tesseract@b0ead95 does not seem to solve this. I did replace a layer training with fonts FreeSerif and FreeSans till 0.01% error rate. However, it seems to still recognize TM trademark sign as letters TM and not the sign, while testing with same tif which was used for training. zip file with training text, synthetic training images, generated traineddata and OCR output with --oem1 is attached.
|
I notice that the unicharset still has TM as normalized version instead of sign.
|
No there are still one or two commits to go before that will work. I might
get them in today.
…On Tue, Jul 25, 2017 at 5:14 AM, Shreeshrii ***@***.***> wrote:
I notice that the unicharset still has TM as normalized version instead of
sign.
Does latin.unicharset need updating?
™ 0 63,201,209,255,101,273,0,59,104,293 Common 112 10 112 TM # ™ [2122 ]
· 10 64,148,129,255,13,238,5,125,39,293 Common 113 10 113 · # · [b7 ]p
℠ 0 130,152,235,249,167,228,3,30,192,234 Common 114 10 114 SM # ℠ [2120 ]
℗ 0 13,65,229,255,165,244,0,30,169,273 Common 115 10 115 ℗ # ℗ [2117 ]
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#63 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056XZjuyi-GE1ZoAK1_71EoZThUYcRks5sRdwYgaJpZM4MvZ7k>
.
--
Ray.
|
Right try it now.
You need commits b0ead95d..0e95e2ca and 1a0f501..3e32be3 (in langdata)
I think they are everything you need.
The new English model will contain TM.
…On Tue, Jul 25, 2017 at 8:29 AM, Ray Smith ***@***.***> wrote:
No there are still one or two commits to go before that will work. I might
get them in today.
On Tue, Jul 25, 2017 at 5:14 AM, Shreeshrii ***@***.***>
wrote:
> I notice that the unicharset still has TM as normalized version instead
> of sign.
> Does latin.unicharset need updating?
>
> ™ 0 63,201,209,255,101,273,0,59,104,293 Common 112 10 112 TM # ™ [2122 ]
> · 10 64,148,129,255,13,238,5,125,39,293 Common 113 10 113 · # · [b7 ]p
> ℠ 0 130,152,235,249,167,228,3,30,192,234 Common 114 10 114 SM # ℠ [2120 ]
> ℗ 0 13,65,229,255,165,244,0,30,169,273 Common 115 10 115 ℗ # ℗ [2117 ]
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#63 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AL056XZjuyi-GE1ZoAK1_71EoZThUYcRks5sRdwYgaJpZM4MvZ7k>
> .
>
--
Ray.
--
Ray.
|
Ray, With the new traineddata, TM is not being recognized at all - it is getting dropped. with eng.traineddata
with new englayer.traineddata
I used the old .lstmf files to do training - would that be a problem? |
I trained again after creating new box/tiff and lstmf files using the new code and new langdata. TM sign is now being recognized correctly. It is also NOT treating fl and fi as ligatures but as separate letters in words such as film, first, flounder, reflect etc. Thanks! |
Great!
That is the objective with fi and fl ligatures. They now have similar
status as tatweel: used for rendering, but not for output, except of course
that fi and fl produce output characters, but tatweel disappears completely.
…On Thu, Jul 27, 2017 at 8:31 PM, Shreeshrii ***@***.***> wrote:
@theraysmith <https://github.com/theraysmith>
I trained again after creating new box/tiff and lstmf files using the new
code and new langdata.
TM sign is now being recognized correctly.
It is also NOT treating *fl* and *fi* as glyphs but as separate letters
in words such as *film, first, flounder, reflect* etc.
Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#63 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056c1Ib60ZnotAwdlQhSlJ2uw0gtnbks5sSVYcgaJpZM4MvZ7k>
.
--
Ray.
|
Copied from 59
[reply to @Shreeshrii]
@theraysmith commented
TM is also difficult, as it is in conflict with the needs of fi/fl, which should not appear in the output.
The text was updated successfully, but these errors were encountered: