Cube and combined modes doesn't work in 3.03 #40

ws233 · 2015-06-29T10:32:59Z

What steps will reproduce the problem?

Try to recognize the attached image with Cube mode. Whitelist is '0123456789'. The result is wrong. It's 1234669890 (6 instead of 5, 9 instead of 7).
Try to recognize the same image with Combined mode. There is a crach with the following error:
init_cube_objects(true, &tessdata_manager):Error:Assert failed:in file tessedit.cpp, line 203
[1] 5562 abort tesseract image_sample.jpg stdout -l rus rus-test
It seems that it happens with eng locale as well as with the rus loc.

What is the expected output? What do you see instead?
In both cases the output should be 1234567890.

What version of the product are you using? On what operating system?
I've tried tesseract 3.03 both on mac and iOS.

Please provide any additional information below.
There is a related thread in the Tesseract-OCR-iOS wrapper, where the issue was originally found: gali8/Tesseract-OCR-iOS#140. You may ask for any additional info there.

zdenop · 2015-06-29T11:26:25Z

According Ray cube is a dead-end and it can be removed soon from the code(see e.g. [1]), so you can not expect any progress...

[1] https://groups.google.com/d/msg/tesseract-dev/mtIJXoUpfJc/6f0EwVNXOM8J

tfmorris · 2016-02-17T18:29:02Z

@zdenop Since Cube is going away, perhaps this can be closed?

zdenop · 2016-02-17T18:47:43Z

lets wait for Ray...

amitdo · 2016-05-28T20:17:56Z

It's 'going away' for several years now... :-)

ghost · 2016-07-02T12:05:52Z

How Cube is being discontinued, it's training procedure has not been published to the public.
Somehow I got the feeling that Cube was purposely sabotaged and hindered from the public.

amitdo · 2016-11-23T10:18:50Z

The new LSTM based engine is here.

@theraysmith, I see that the Cube engine is still present in the code. Are you going to drop it in the final 4.0 release?

theraysmith · 2016-11-26T02:29:28Z

Actually I have a comment for this.
There is one reason why cube has survived this long: For Hindi cube+tesseract has half the error rate of either on their own. I haven't actually tested that against the new LSTM engine yet, but I will on Monday, and if the new LSTM engine is better, then yes, cube is likely to get the chop for 4.00, and the ifdefs will be very useful.

Shreeshrii · 2016-11-26T05:40:02Z

@theraysmith

Since the hardware requirements for 4.0 are going to be higher than for 3.xx versions, it will be good to keep Hindi cube+tesseract version also available.

The accuracy results that you are mentioning for Hindi are for which version - 3.02, 3.03, 3.04 ?

theraysmith · 2016-11-28T18:13:32Z

Tests complete. Decision made. Cube is going away in 4.00.
Results:

Engine	Total char errors	Word Recall Errors	Word Precision Errors	Walltime	CPUtime*
Tess 3.04	13.9	30	31.2	3.0	2.8
Cube	15.1	29.5	30.7	3.4	3.1
Tess+Cube	11.0	24.2	25.4	5.7	5.3
LSTM	7.6	20.9	20.8	1.5	2.5

Note in the above table that LSTM is faster than Tess 3.04 (without adding cube) in both wall time and CPU time! For wall time by a factor of 2.

zdenop · 2016-11-28T18:29:11Z

Can you provide some details about used hardware for test?
Did you made test also on single core CPU to see difference?

stweil · 2016-11-28T18:40:41Z

And what about the language model used for the test? Is it already available so I can use it for my own tests?

theraysmith · 2016-11-28T19:05:06Z

OK, the big test I ran in a Google data center. I just ran a test on my machine (HP Z420) on a single Hindi page for comparison, ran each one 3 times (using time), and took the median. My machine has AVX, so that will have still speeded it up a bit, so I tried without AVX/SSE as well: I disabled OpenMP by adding #undef _OPENMP in functions.h, line 33, and disabled AVX/SSE in weightmatrix.cpp, line 66,67. Test Mode Real User Default (-oem 3 = cube + tess) 7.6 7.3 Base Tess (-oem 0) 2.9 2.6 Cube (-oem 1) 5.4 4.9 LSTM With OpenMP+AVX 1.8 3.8 LSTM No OpenMP with AVX 2.7 2.4 LSTM No OpenMP with SSE 3.1 2.7 LSTM No OpenMP no SIMD at all 4.6 4.1 I think these tests nail cube as being slower and less accurate. There may be a debate as to the value of the old Tesseract engine for its speed vs the new one for its accuracy. I'm going to push the data files now.

…

On Mon, Nov 28, 2016 at 10:40 AM, Stefan Weil ***@***.***> wrote: And what about the language model used for the test? Is it already available so I can use it for my own tests? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#40 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056W2jegBHZM59ZxH-U5q22tSBy-HZks5rCyAxgaJpZM4FOBFi> .

-- Ray.

stweil · 2016-11-28T21:49:01Z

I'm going to push the data files now.

Got the first ones. My first test with a simple screenshot gave significant better results with LSTM, but needed 16 minutes CPU time (instead of 9 seconds) with a debug build of Tesseract (-O0). A release build (-O2) needs 17 seconds with LSTM, 4 seconds without for the same image.

Are there also new data files planned for old German (deu_frak)? I was surprised that the default English model with LSTM could recognize some words.

theraysmith · 2016-11-28T23:47:56Z

On Mon, Nov 28, 2016 at 1:49 PM, Stefan Weil ***@***.***> wrote: I'm going to push the data files now. Got the first ones. My first test with a simple screenshot gave significant better results with LSTM, but needed 16 minutes CPU time (instead of 9 seconds) with a debug build of Tesseract (-O0). A release build (-O2) needs 17 seconds with LSTM, 4 seconds without for the same image.

The slow speed with debug is to be expected. The new code is much more memory intensive, so it is a lot slower on debug (also openmp is turned off by choice on debug). The optimized build speed sounds about right for Latin-based languages. It is the complex scripts that will run faster relative to base Tesseract.

Are there also new data files planned for old German (deu_frak)? I was surprised that the default English model with LSTM could recognize some words.

I don't think I generated the original deu_frak. I have the fonts to do so with LSTM, but I don't know if I have a decent amount of corpus data to hand. With English at least, the language was different in the days of Fraktur (Ye Olde shoppe). I know German continued to be written in Fraktur until the 1940s, so that might be easier. Or is there an old German that is analogous to Ye Old Shoppe for English?

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#40 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056Ti1gWSSG6BfuBbL68EE7RYfsItOks5rC0xWgaJpZM4FOBFi> .

-- Ray.

Shreeshrii · 2016-11-29T04:14:35Z

1. Is there a 3.04 vs 4.0 branch in tessdata for the traineddata files? 2. Stefan, please share the binaries for 4.0 alpha for Windows. I am interested in trying the hindi and other indian languages traineddata. Thanks.

…

On 29-Nov-2016 5:18 AM, "theraysmith" ***@***.***> wrote: On Mon, Nov 28, 2016 at 1:49 PM, Stefan Weil ***@***.***> wrote: > I'm going to push the data files now. > > Got the first ones. My first test with a simple screenshot gave > significant better results with LSTM, but needed 16 minutes CPU time > (instead of 9 seconds) with a debug build of Tesseract (-O0). A release > build (-O2) needs 17 seconds with LSTM, 4 seconds without for the same > image. > The slow speed with debug is to be expected. The new code is much more memory intensive, so it is a lot slower on debug (also openmp is turned off by choice on debug). The optimized build speed sounds about right for Latin-based languages. It is the complex scripts that will run faster relative to base Tesseract. > Are there also new data files planned for old German (deu_frak)? I was > surprised that the default English model with LSTM could recognize some > words. > I don't think I generated the original deu_frak. I have the fonts to do so with LSTM, but I don't know if I have a decent amount of corpus data to hand. With English at least, the language was different in the days of Fraktur (Ye Olde shoppe). I know German continued to be written in Fraktur until the 1940s, so that might be easier. Or is there an old German that is analogous to Ye Old Shoppe for English? > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#40# issuecomment-263405208>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ AL056Ti1gWSSG6BfuBbL68EE7RYfsItOks5rC0xWgaJpZM4FOBFi> > . > -- Ray. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#40 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_ox9UBjscGCZrZM-bSZ-Eimw91bkXks5rC2g5gaJpZM4FOBFi> .

stweil · 2016-11-29T07:04:43Z

I know German continued to be written in Fraktur until the 1940s, so that might be easier. Or is there an old German that is analogous to Ye Old Shoppe for English?

Fraktur was used for an important German newspaper (Reichsanzeiger) until 1945. I'd like to try some pages from that newspaper with Tesseract LSTM. Surprisingly even with the English data Tesseract was able to recognize at least some words written in Fraktur. Could you give me some hints how to create the data for deu_frak?

There is an Old High German (similar to Old English), but the German translation of the New Testament by Martin Luther (1521) was one of the first major printed books in German, and basically it started the modern German language (High German) which is used until today.

zdenop · 2016-11-29T07:47:04Z

I think it would be great to move this discussion to (developers) forum. we are already out scope of original issue post and much more people should be interested in "Faktur topic"...

stweil · 2016-11-29T14:00:28Z

Stefan, please share the binaries for 4.0 alpha for Windows.

@Shreeshrii, they are online now at the usual location. See also the related pull request #511. Please report results either in the developer forum as suggested by @zdenop or by personal mail to me.

amitdo · 2016-11-29T14:00:56Z

Is there a 3.04 vs 4.0 branch in tessdata for the traineddata files?

https://github.com/tesseract-ocr/tessdata/tree/3.04.00

Shreeshrii · 2016-11-29T15:25:33Z

Thanks, Amit. Please add the info to the wiki also, if you have not already done so.

…

On 29-Nov-2016 7:31 PM, "Amit D." ***@***.***> wrote: Is there a 3.04 vs 4.0 branch in tessdata for the traineddata files? https://github.com/tesseract-ocr/tessdata/tree/3.04.00 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#40 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o3a5AJ3je8sWymCBCFk_0jhhM1d9ks5rDDAegaJpZM4FOBFi> .

Shreeshrii · 2016-11-29T15:27:25Z

Thanks, I will give it a try and report back.

…

On 29-Nov-2016 7:30 PM, "Stefan Weil" ***@***.***> wrote: Stefan, please share the binaries for 4.0 alpha for Windows. @Shreeshrii <https://github.com/Shreeshrii>, they are online now at the usual location. Please report results either in the developer forum as suggested by @zdenop <https://github.com/zdenop> or by personal mail to me. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#40 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o5J0hvvS7W2el0QDzXNO957qHvumks5rDDABgaJpZM4FOBFi> .

amitdo · 2016-11-29T15:32:41Z

Amit. Please add the info to the wiki also, if you have not already done so.

You can do it yourself... :)

Shreeshrii · 2016-11-29T18:36:21Z

@theraysmith @stweil

Thank you! I tested a few devanagari pages with the 4.0 alpha windows binaries and traineddata for Hindi, Sanskrit, Marathi and Nepali. This was on a Windows 10 netbook, intel atom 1.33 ghz cpu, x64 based processor, 32 bit os, 2 GB RAM. I tested only with single page images and there was no performance problem on this basic netbook. The accuracy is much improved in the LSTM version. This is by just eyeballing the output (not using any software for comparison).

From a user point of view, better accuracy maybe preferred to speed. So LSTM based engine seems the way to go, at least for devanagari scripts. I will test some of the other Indian languages later.

I have noticed some differences in processing between Hindi and the other Devanagari based languages and will add issues to the tessdata repository.

Thanks to the developers at Google and the tesseract community!

jbaiter · 2016-11-29T22:04:52Z

@theraysmith

I don't think I generated the original deu_frak. I have the fonts to do so with LSTM, but I don't know if I have a decent amount of corpus data to hand.

I have a decent amount of corpus data for Fraktur from scanned books at hand, about 500k lines in hOCR files (~50GB with TIF images). I've yet to publish it, but if you have somewhere where I could send/upload it, I'd be glad to.

Or is there a way to create the neccessary training files myself? I've had a cursory look through the OCR code and it looked like it needed lstmf files, but I haven't yet found what these are supposed to look like.

theraysmith · 2016-11-29T22:21:16Z

I have a new training md file in prep with an update to the code to make it all work correctly. It is going through our review process, and then I will need to sync again with the changes that have happened since my last sync, but it should be available late this week. The md file documents the training process in tutorial detail, but line boxes and transcriptions sounds perfect! 500k lines should make it work really well. I would be happy to take it and help you, but we would have to get into licenses, copyright and all that first. For now it might be best to hang on for the instructions.

…

On Tue, Nov 29, 2016 at 2:05 PM, Johannes Baiter ***@***.***> wrote: @theraysmith <https://github.com/theraysmith> I don't think I generated the original deu_frak. I have the fonts to do so with LSTM, but I don't know if I have a decent amount of corpus data to hand. I have a decent amount of corpus data for Fraktur (from scanned books) at hand (hOCR files with line boxes and transcriptions), it's about 500k lines and 50GB. I've yet to publish it, but if you have somewhere where I could send/upload it, I'd be glad to. Or is there a way to create the neccessary training files myself? I've had a cursory look through the OCR code and it looked like it needed lstmf files, but I haven't yet found what these are supposed to look like. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#40 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056WSh7RpK3l-EHGgky_xbByYqhbwSks5rDKGOgaJpZM4FOBFi> .

-- Ray.

jbaiter · 2016-11-29T22:30:11Z

500k lines should make it work really well. I would be happy to take it and help you, but we would have to get into licenses, copyright and all that first.

The text is CC0 and the images are CC-BY-NC, so that shouldn't be an issue :-) They're going to be public anyway once I've prepped the dataset for publication.
But even better if there are instructions, looking forward to playing around with training!

Shreeshrii · 2016-11-30T11:14:18Z

Ray, Please see my recent comment and attached files in tesseract-ocr/tessdata#6 Adding config files to trained data for san, mar and nep will fix this issue related to skipped text with default psm. I made a copy of hin.config and changed the default engine to oem 4, LSTM. I also removed the blacklisting of 1, since Indo-arabic numbers in Latin scripts are used quite commonly with Devanagari script text. There are various other Devanagari related options in the config file, which can be removed, if not needed with LSTM. Thanks.

theraysmith · 2016-12-14T19:40:59Z

Cube is gone! Removal completed as of 9d6e4f6

amitdo · 2016-12-14T20:01:52Z

Sad news: Cube is no longer with us.

Cube, you will be missed...

Shreeshrii · 2017-01-19T08:46:57Z

@jbaiter have you tried 4.0 training for Fraktur?

@theraysmith Is there a way to use the old box-tiff pairs at https://github.com/paalberti/tesseract-dan-fraktur for LSTM training?

Also see tesseract related issue at paalberti/tesseract-dan-fraktur#3

amitdo · 2017-01-19T13:34:39Z

Is there a way to use the old box-tiff pairs at https://github.com/paalberti/tesseract-dan-fraktur for LSTM training?

There will be a way to generate a box file from a tiff image. The box file will be written in the textline format
#659 (comment)
I started working on this today. I wrote the needed code and It seems to output the desired format, but I need to do some tests before publishing it.

Shreeshrii · 2017-01-19T14:24:30Z

@amitdo Not sure if that will work for Devanagari, because of the length of unicode string.

Is it possible to just add a box with the tab character at end of each line for existing box files?

amitdo · 2017-01-19T15:02:04Z

Not sure if that will work for Devanagari, because of the length of unicode string.

We will wait and see...

Is it possible to just add a box with the tab character at end of each line for existing box files?

You mean manually?
You should add box coordinates, not just tab.

ws233 mentioned this issue Jul 21, 2015

numbers not properly recognised gali8/Tesseract-OCR-iOS#200

Closed

amitdo added the bug label Jun 5, 2016

zdenop added the wontfix label Sep 1, 2016

zdenop closed this as completed Nov 24, 2016

amitdo mentioned this issue Nov 28, 2016

Arabic language (right to left in writing) stored (left to right) after create PDF Searchable #238

Open

Shreeshrii mentioned this issue Dec 2, 2016

4.0 alpha: traineddata size tesseract-ocr/tessdata#31

Closed

This was referenced Dec 31, 2016

LSTM vs BLSTM vs MDLSTM #630

Closed

OEM_LSTM_ONLY very slow #632

Closed

amitdo mentioned this issue Mar 13, 2017

German Fraktur tesseract-ocr/langdata#59

Open

amitdo mentioned this issue May 11, 2017

[Clarification request/Improvement] Add version and github info always, also when --enable-debug was missing #723

Closed

afolarin mentioned this issue Jul 8, 2017

Test LSTM OCR Engine in Tesseract CogStack/CogStack-Pipeline#30

Closed

nijanthan0 mentioned this issue Apr 22, 2019

compute ctc target failed #2395

Open

M3ssman mentioned this issue Jan 8, 2020

Autoconf Failure for 4.1.1-rc1 #2851

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cube and combined modes doesn't work in 3.03 #40

Cube and combined modes doesn't work in 3.03 #40

ws233 commented Jun 29, 2015

zdenop commented Jun 29, 2015

tfmorris commented Feb 17, 2016

zdenop commented Feb 17, 2016

amitdo commented May 28, 2016

ghost commented Jul 2, 2016

amitdo commented Nov 23, 2016

theraysmith commented Nov 26, 2016

Shreeshrii commented Nov 26, 2016

theraysmith commented Nov 28, 2016

zdenop commented Nov 28, 2016

stweil commented Nov 28, 2016

theraysmith commented Nov 28, 2016 via email

stweil commented Nov 28, 2016

theraysmith commented Nov 28, 2016 via email

Shreeshrii commented Nov 29, 2016 via email

stweil commented Nov 29, 2016

zdenop commented Nov 29, 2016

stweil commented Nov 29, 2016 •

edited

amitdo commented Nov 29, 2016

Shreeshrii commented Nov 29, 2016 via email

Shreeshrii commented Nov 29, 2016 via email

amitdo commented Nov 29, 2016

Shreeshrii commented Nov 29, 2016

jbaiter commented Nov 29, 2016 •

edited

theraysmith commented Nov 29, 2016 via email

jbaiter commented Nov 29, 2016

Shreeshrii commented Nov 30, 2016 via email

theraysmith commented Dec 14, 2016

amitdo commented Dec 14, 2016 •

edited

Shreeshrii commented Jan 19, 2017

amitdo commented Jan 19, 2017 •

edited

Shreeshrii commented Jan 19, 2017

amitdo commented Jan 19, 2017

Cube and combined modes doesn't work in 3.03 #40

Cube and combined modes doesn't work in 3.03 #40

Comments

ws233 commented Jun 29, 2015

zdenop commented Jun 29, 2015

tfmorris commented Feb 17, 2016

zdenop commented Feb 17, 2016

amitdo commented May 28, 2016

ghost commented Jul 2, 2016

amitdo commented Nov 23, 2016

theraysmith commented Nov 26, 2016

Shreeshrii commented Nov 26, 2016

theraysmith commented Nov 28, 2016

zdenop commented Nov 28, 2016

stweil commented Nov 28, 2016

theraysmith commented Nov 28, 2016 via email

stweil commented Nov 28, 2016

theraysmith commented Nov 28, 2016 via email

Shreeshrii commented Nov 29, 2016 via email

stweil commented Nov 29, 2016

zdenop commented Nov 29, 2016

stweil commented Nov 29, 2016 • edited

amitdo commented Nov 29, 2016

Shreeshrii commented Nov 29, 2016 via email

Shreeshrii commented Nov 29, 2016 via email

amitdo commented Nov 29, 2016

Shreeshrii commented Nov 29, 2016

jbaiter commented Nov 29, 2016 • edited

theraysmith commented Nov 29, 2016 via email

jbaiter commented Nov 29, 2016

Shreeshrii commented Nov 30, 2016 via email

theraysmith commented Dec 14, 2016

amitdo commented Dec 14, 2016 • edited

Shreeshrii commented Jan 19, 2017

amitdo commented Jan 19, 2017 • edited

Shreeshrii commented Jan 19, 2017

amitdo commented Jan 19, 2017

stweil commented Nov 29, 2016 •

edited

jbaiter commented Nov 29, 2016 •

edited

amitdo commented Dec 14, 2016 •

edited

amitdo commented Jan 19, 2017 •

edited