Correct handling of TM sign #63

amitdo · 2017-03-31T09:03:35Z

Copied from 59

TM is also difficult, as it is in conflict with the needs of fi/fl, which should not appear in the output.

Shreeshrii · 2017-03-31T09:10:40Z

Copied from: issue tesseract-ocr/tesseract#761

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/JvEF7f0KU8I/La50m7SzEgAJ

the trademark symbol is still not recognized properly. With the newly generated traineddata the symbol is recognized as TM

Looking at Latin.unicharset I see that the normalized form of these is just the regular numbers or TM.

@theraysmith Does this need to be changed?

™ 0 63,201,209,255,101,273,0,59,104,293 Common 1496 10 1496 TM # ™ [2122 ]
² 0 3,192,209,255,50,248,0,105,0,293 Common 1090 2 1090 2 # ² [b2 ]
³ 0 0,192,209,255,48,268,0,99,0,293 Common 1091 2 1091 3 # ³ [b3 ]

theraysmith · 2017-03-31T22:42:12Z

After thinkiing about this carefully, I decided to undo a change I had made for the LSTM engine, and better solve tatweel. The fi/fl ligatures will no longer be included in unicharsets, but will still be included in the training text, by replacing them with fi/fl pairs at the same time that tatweel is deleted. This allows the output to be un-normalized, shaped quotes to be brought back, and the TM symbol recognized as a single character. It doesn't help with the sub/superscript problem, and I have another idea that I want to try that is more important first...

…

On Fri, Mar 31, 2017 at 2:10 AM, Shreeshrii ***@***.***> wrote: Copied from: issue tesseract-ocr/tesseract#761 <tesseract-ocr/tesseract#761> https://groups.google.com/forum/?utm_medium=email&utm_ source=footer#!msg/tesseract-ocr/JvEF7f0KU8I/La50m7SzEgAJ the trademark symbol is still not recognized properly. With the newly generated traineddata the symbol is recognized as TM Looking at Latin.unicharset I see that the normalized form of these is just the regular numbers or TM. @theraysmith <https://github.com/theraysmith> Does this need to be changed? ™ 0 63,201,209,255,101,273,0,59,104,293 Common 1496 10 1496 TM # ™ [2122 ] ² 0 3,192,209,255,50,248,0,105,0,293 Common 1090 2 1090 2 # ² [b2 ] ³ 0 0,192,209,255,48,268,0,99,0,293 Common 1091 2 1091 3 # ³ [b3 ] — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#63 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056S7lNKS0yOgG9kjZBsmxKT7ziM9Dks5rrMMRgaJpZM4MvZ7k> .

-- Ray.

amitdo · 2017-04-15T10:06:31Z

#59 (comment)

theraysmith commented

Thanks for opening the new issues 62, 63. I will continue to think about the best approach. I tried to include TM in the current round of training, but it is too infrequent to have made the cut line. I will have to add it to the desired_characters list.

amitdo · 2017-07-24T18:57:11Z

tesseract-ocr/tesseract@b0ead95d64a366

Shreeshrii · 2017-07-25T12:07:16Z

@theraysmith

tesseract-ocr/tesseract@b0ead95 does not seem to solve this.
Does it also require your newer language models?

I did replace a layer training with fonts FreeSerif and FreeSans till 0.01% error rate. However, it seems to still recognize TM trademark sign as letters TM and not the sign, while testing with same tif which was used for training.

zip file with training text, synthetic training images, generated traineddata and OCR output with --oem1 is attached.

eng.englayer.zip

tesseract -v
tesseract b0ead95
 leptonica-1.74.4
  libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.0 : libopenjp2 2.1.2

tesseract eng.FreeSerif.exp0.tif eng.FreeSerif.englayer -l englayer --oem 1 --psm 6 --tessdata-dir ../../tessdata

Shreeshrii · 2017-07-25T12:14:15Z

I notice that the unicharset still has TM as normalized version instead of sign.
Does latin.unicharset need updating?

™ 0 63,201,209,255,101,273,0,59,104,293 Common 112 10 112 TM	# ™ [2122 ]
· 10 64,148,129,255,13,238,5,125,39,293 Common 113 10 113 ·	# · [b7 ]p
℠ 0 130,152,235,249,167,228,3,30,192,234 Common 114 10 114 SM	# ℠ [2120 ]
℗ 0 13,65,229,255,165,244,0,30,169,273 Common 115 10 115 ℗	# ℗ [2117 ]

theraysmith · 2017-07-25T15:29:33Z

No there are still one or two commits to go before that will work. I might get them in today.

…

On Tue, Jul 25, 2017 at 5:14 AM, Shreeshrii ***@***.***> wrote: I notice that the unicharset still has TM as normalized version instead of sign. Does latin.unicharset need updating? ™ 0 63,201,209,255,101,273,0,59,104,293 Common 112 10 112 TM # ™ [2122 ] · 10 64,148,129,255,13,238,5,125,39,293 Common 113 10 113 · # · [b7 ]p ℠ 0 130,152,235,249,167,228,3,30,192,234 Common 114 10 114 SM # ℠ [2120 ] ℗ 0 13,65,229,255,165,244,0,30,169,273 Common 115 10 115 ℗ # ℗ [2117 ] — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#63 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056XZjuyi-GE1ZoAK1_71EoZThUYcRks5sRdwYgaJpZM4MvZ7k> .

-- Ray.

theraysmith · 2017-07-25T16:50:56Z

Right try it now. You need commits b0ead95d..0e95e2ca and 1a0f501..3e32be3 (in langdata) I think they are everything you need. The new English model will contain TM.

…

On Tue, Jul 25, 2017 at 8:29 AM, Ray Smith ***@***.***> wrote: No there are still one or two commits to go before that will work. I might get them in today. On Tue, Jul 25, 2017 at 5:14 AM, Shreeshrii ***@***.***> wrote: > I notice that the unicharset still has TM as normalized version instead > of sign. > Does latin.unicharset need updating? > > ™ 0 63,201,209,255,101,273,0,59,104,293 Common 112 10 112 TM # ™ [2122 ] > · 10 64,148,129,255,13,238,5,125,39,293 Common 113 10 113 · # · [b7 ]p > ℠ 0 130,152,235,249,167,228,3,30,192,234 Common 114 10 114 SM # ℠ [2120 ] > ℗ 0 13,65,229,255,165,244,0,30,169,273 Common 115 10 115 ℗ # ℗ [2117 ] > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#63 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AL056XZjuyi-GE1ZoAK1_71EoZThUYcRks5sRdwYgaJpZM4MvZ7k> > . > -- Ray.

-- Ray.

Shreeshrii · 2017-07-26T03:58:34Z

Ray,
I updated langdata and tesseract and built tesseract again.

With the new traineddata, TM is not being recognized at all - it is getting dropped.

with eng.traineddata

The trademark symbol (*), in Unicode U+2122 *~ trade mark sign (HTML &#8482; — &trade;),
\texttrademark in LaTeX,[1] [2] is a symbol used to indicate an assertion that the preceding mark

is a trademark. Registered trademarks are indicated using the registered trademark symbol (®),

with new englayer.traineddata

The trademark symbol (), in Unicode U+2122 " trade mark sign (HTML &#8482; · &trade;),
\texttrademark in LaTeX,[1] [2] is a symbol used to indicate an assertion that the preceding mark

is a trademark. Registered trademarks are indicated using the registered trademark symbol (®),

I used the old .lstmf files to do training - would that be a problem?

Shreeshrii · 2017-07-28T03:31:38Z

@theraysmith

I trained again after creating new box/tiff and lstmf files using the new code and new langdata.

TM sign is now being recognized correctly.

It is also NOT treating fl and fi as ligatures but as separate letters in words such as film, first, flounder, reflect etc.

Thanks!

eng.FreeSerif.engTM.txt

theraysmith · 2017-07-28T04:49:42Z

Great! That is the objective with fi and fl ligatures. They now have similar status as tatweel: used for rendering, but not for output, except of course that fi and fl produce output characters, but tatweel disappears completely.

…

On Thu, Jul 27, 2017 at 8:31 PM, Shreeshrii ***@***.***> wrote: @theraysmith <https://github.com/theraysmith> I trained again after creating new box/tiff and lstmf files using the new code and new langdata. TM sign is now being recognized correctly. It is also NOT treating *fl* and *fi* as glyphs but as separate letters in words such as *film, first, flounder, reflect* etc. Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#63 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056c1Ib60ZnotAwdlQhSlJ2uw0gtnbks5sSVYcgaJpZM4MvZ7k> .

-- Ray.

amitdo mentioned this issue Mar 31, 2017

German Fraktur #59

Open

Shreeshrii mentioned this issue Mar 31, 2017

Trademark symbol and superscripts etc. tesseract-ocr/tesseract#761

Closed

Shreeshrii mentioned this issue Oct 7, 2019

Can't encode transcription tesseract-ocr/tesseract#2695

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct handling of TM sign #63

Correct handling of TM sign #63

amitdo commented Mar 31, 2017

Shreeshrii commented Mar 31, 2017

theraysmith commented Mar 31, 2017 via email

amitdo commented Apr 15, 2017

amitdo commented Jul 24, 2017

Shreeshrii commented Jul 25, 2017 •

edited

Shreeshrii commented Jul 25, 2017

theraysmith commented Jul 25, 2017 via email

theraysmith commented Jul 25, 2017 via email

Shreeshrii commented Jul 26, 2017

Shreeshrii commented Jul 28, 2017 •

edited

theraysmith commented Jul 28, 2017 via email

Correct handling of TM sign #63

Correct handling of TM sign #63

Comments

amitdo commented Mar 31, 2017

Shreeshrii commented Mar 31, 2017

Copied from: issue tesseract-ocr/tesseract#761

theraysmith commented Mar 31, 2017 via email

amitdo commented Apr 15, 2017

amitdo commented Jul 24, 2017

Shreeshrii commented Jul 25, 2017 • edited

Shreeshrii commented Jul 25, 2017

theraysmith commented Jul 25, 2017 via email

theraysmith commented Jul 25, 2017 via email

Shreeshrii commented Jul 26, 2017

Shreeshrii commented Jul 28, 2017 • edited

theraysmith commented Jul 28, 2017 via email

Shreeshrii commented Jul 25, 2017 •

edited

Shreeshrii commented Jul 28, 2017 •

edited