Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct handling of TM sign #63

Open
amitdo opened this issue Mar 31, 2017 · 11 comments
Open

Correct handling of TM sign #63

amitdo opened this issue Mar 31, 2017 · 11 comments

Comments

@amitdo
Copy link

amitdo commented Mar 31, 2017

Copied from 59


[reply to @Shreeshrii]

@theraysmith commented

TM is also difficult, as it is in conflict with the needs of fi/fl, which should not appear in the output.

@Shreeshrii
Copy link
Contributor

Copied from: issue tesseract-ocr/tesseract#761

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/JvEF7f0KU8I/La50m7SzEgAJ

the trademark symbol is still not recognized properly. With the newly generated traineddata the symbol is recognized as TM

Looking at Latin.unicharset I see that the normalized form of these is just the regular numbers or TM.

@theraysmith Does this need to be changed?

™ 0 63,201,209,255,101,273,0,59,104,293 Common 1496 10 1496 TM # ™ [2122 ]
² 0 3,192,209,255,50,248,0,105,0,293 Common 1090 2 1090 2 # ² [b2 ]
³ 0 0,192,209,255,48,268,0,99,0,293 Common 1091 2 1091 3 # ³ [b3 ]

@theraysmith
Copy link
Contributor

theraysmith commented Mar 31, 2017 via email

@amitdo
Copy link
Author

amitdo commented Apr 15, 2017

#59 (comment)

theraysmith commented

Thanks for opening the new issues 62, 63. I will continue to think about the best approach. I tried to include TM in the current round of training, but it is too infrequent to have made the cut line. I will have to add it to the desired_characters list.

@amitdo
Copy link
Author

amitdo commented Jul 24, 2017

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Jul 25, 2017

@theraysmith

tesseract-ocr/tesseract@b0ead95 does not seem to solve this.
Does it also require your newer language models?

I did replace a layer training with fonts FreeSerif and FreeSans till 0.01% error rate. However, it seems to still recognize TM trademark sign as letters TM and not the sign, while testing with same tif which was used for training.

zip file with training text, synthetic training images, generated traineddata and OCR output with --oem1 is attached.

eng.englayer.zip

tesseract -v
tesseract b0ead95
 leptonica-1.74.4
  libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.0 : libopenjp2 2.1.2

tesseract eng.FreeSerif.exp0.tif eng.FreeSerif.englayer -l englayer --oem 1 --psm 6 --tessdata-dir ../../tessdata

@Shreeshrii
Copy link
Contributor

I notice that the unicharset still has TM as normalized version instead of sign.
Does latin.unicharset need updating?

™ 0 63,201,209,255,101,273,0,59,104,293 Common 112 10 112 TM	# ™ [2122 ]
· 10 64,148,129,255,13,238,5,125,39,293 Common 113 10 113 ·	# · [b7 ]p
℠ 0 130,152,235,249,167,228,3,30,192,234 Common 114 10 114 SM	# ℠ [2120 ]
℗ 0 13,65,229,255,165,244,0,30,169,273 Common 115 10 115 ℗	# ℗ [2117 ]

@theraysmith
Copy link
Contributor

theraysmith commented Jul 25, 2017 via email

@theraysmith
Copy link
Contributor

theraysmith commented Jul 25, 2017 via email

@Shreeshrii
Copy link
Contributor

Ray,
I updated langdata and tesseract and built tesseract again.

With the new traineddata, TM is not being recognized at all - it is getting dropped.

with eng.traineddata

The trademark symbol (*), in Unicode U+2122 *~ trade mark sign (HTML ™ — ™),
\texttrademark in LaTeX,[1] [2] is a symbol used to indicate an assertion that the preceding mark

is a trademark. Registered trademarks are indicated using the registered trademark symbol (®),

with new englayer.traineddata

The trademark symbol (), in Unicode U+2122 " trade mark sign (HTML ™ · ™),
\texttrademark in LaTeX,[1] [2] is a symbol used to indicate an assertion that the preceding mark

is a trademark. Registered trademarks are indicated using the registered trademark symbol (®),

I used the old .lstmf files to do training - would that be a problem?

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Jul 28, 2017

@theraysmith

I trained again after creating new box/tiff and lstmf files using the new code and new langdata.

TM sign is now being recognized correctly.

It is also NOT treating fl and fi as ligatures but as separate letters in words such as film, first, flounder, reflect etc.

Thanks!

eng.FreeSerif.engTM.txt

@theraysmith
Copy link
Contributor

theraysmith commented Jul 28, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants