Fix and enable lstm related unittests #2180

Shreeshrii · 2019-01-22T13:29:59Z

I will upload required testdata files to the test repo.

Shreeshrii · 2019-01-23T09:04:11Z

@stweil Please see the attached log file. The error rates for many tests are much lower than the expected values. I am wondering if it is related to using the Batch/Mean error as the Best error. Is this the same way error rates are calculated in tesseract?

Of course the difference could just be because the 'testdata' is different.

Fixed a merge conflict. Signed-off-by: Stefan Weil <sw@weilnetz.de>

stweil · 2019-01-23T14:53:02Z

I added a commit to fix a merge conflict with Git master.

stweil · 2019-01-23T14:59:31Z

unittest/log.h

      break;
    case ERROR:
-      std::cout << "[ERROR] ";
+      std::cout << "\n[ERROR] ";


@Shreeshrii, did you find the implementation which is used by Google, and does that implementation add line feeds like that?

Google might be using the implementation in glog - https://github.com/google/glog/blob/master/src/windows/glog/logging.h

I added the linefeed because I thought it might increase readability. It could probably be replaced by a space.

I see. The Google implementation adds a linefeed at the end if the log string does not already end with one. As Tesseract only has a few users of LOG, I think the linefeed characters can be added locally when calling LOG if needed. I suggest to remove the 3rd commit, at least for now.

@stweil I don't know how to remove the commit. I have added another commit reverting the change.

stweil · 2019-01-24T12:01:53Z

unittest/include_gunit.h

@@ -17,7 +17,7 @@
 #include "fileio.h"   // for tesseract::File
 #include "gtest/gtest.h"

-const char* FLAGS_test_tmpdir = ".";
+const char* FLAGS_test_tmpdir = "./tmp";


That directory is missing for builds which are not started in the root directory, so a lot of tests fail or crash currently. Do we need this change?

When I was running make check in tesseract root directory all files generated by unittests were being created in the unittest root directory. make clean did not remove them. So I thought it would be helpful to have a separate directory for the generated files.

There maybe a better way to accomplish this. Please change as you see fit. Thanks.

Shreeshrii · 2019-01-24T13:15:40Z

When built with --enable-openmp

[       OK ] LSTMTrainerTest.TestSquashed (121074 ms)
[----------] 1 test from LSTMTrainerTest (121074 ms total)

With --disable-openmp

[       OK ] LSTMTrainerTest.TestSquashed (250335 ms)
[----------] 1 test from LSTMTrainerTest (250335 ms total)

Shreeshrii · 2019-01-29T04:48:15Z

@stweil Ref: #2180 (comment)

Did you have a chance to look into this?

I reran tesstutorial today. According to Ray in https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#training-from-scratch

The character error rate falls below 50% just after 3700 iterations, and by 5000 to about 13%, where it will terminate. (In about 20 minutes on a current high-end machine with AVX.)

In my test run today

At iteration 3684/3700/3700, Mean rms=1.506%, delta=3.148%, char train=10.694%, word train=25.578%, skip ratio=0%,  New best char error = 10.694 wrote best model:/home/ubuntu/tesstutorial/engoutput/base10.694_3684.checkpoint wrote checkpoint.

At iteration 4764/5000/5000, Mean rms=0.902%, delta=1.113%, char train=3.729%, word train=9.967%, skip ratio=0%,  New best char error = 3.729 wrote best model:/home/ubuntu/tesstutorial/engoutput/base3.729_4764.checkpoint wrote checkpoint.

Shreeshrii added 3 commits January 22, 2019 13:25

Fix and build lstm related unittests

f6501d7

Use ./tmp instead of ./ for files created by unittests

94747c0

Add linefeed before LOG messages

1ce9152

Shreeshrii changed the title ~~Fix and build lstm related unittests~~ Fix and enable lstm related unittests Jan 22, 2019

ghost assigned stweil Jan 23, 2019

ghost added the review label Jan 23, 2019

Merge remote-tracking branch 'tesseract-ocr/master' into master

9d10c95

Fixed a merge conflict. Signed-off-by: Stefan Weil <sw@weilnetz.de>

stweil reviewed Jan 23, 2019

View reviewed changes

revert the change adding linefeed

acb661b

stweil merged commit bbd23bb into tesseract-ocr:master Jan 24, 2019

ghost removed the review label Jan 24, 2019

stweil reviewed Jan 24, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix and enable lstm related unittests #2180

Fix and enable lstm related unittests #2180

Shreeshrii commented Jan 22, 2019

Shreeshrii commented Jan 23, 2019

stweil commented Jan 23, 2019

stweil Jan 23, 2019

Shreeshrii Jan 23, 2019

stweil Jan 24, 2019

Shreeshrii Jan 24, 2019

stweil Jan 24, 2019

Shreeshrii Jan 24, 2019

Shreeshrii commented Jan 24, 2019 •

edited

Shreeshrii commented Jan 29, 2019

Fix and enable lstm related unittests #2180

Fix and enable lstm related unittests #2180

Conversation

Shreeshrii commented Jan 22, 2019

Shreeshrii commented Jan 23, 2019

stweil commented Jan 23, 2019

stweil Jan 23, 2019

Choose a reason for hiding this comment

Shreeshrii Jan 23, 2019

Choose a reason for hiding this comment

stweil Jan 24, 2019

Choose a reason for hiding this comment

Shreeshrii Jan 24, 2019

Choose a reason for hiding this comment

stweil Jan 24, 2019

Choose a reason for hiding this comment

Shreeshrii Jan 24, 2019

Choose a reason for hiding this comment

Shreeshrii commented Jan 24, 2019 • edited

Shreeshrii commented Jan 29, 2019

Shreeshrii commented Jan 24, 2019 •

edited