4.0 Accuracy and Performance

jbarlow83 edited this page Jan 28, 2017 · 8 revisions

See testing section in https://github.com/tesseract-ocr/docs/blob/master/das_tutorial2016/7Building%20a%20Multi-Lingual%20OCR%20Engine.pdf for accuracy rates for different languages.

Big test in Google Data Center (Hindi?)

Engine Total char errors Word Recall Errors Word Precision Errors Walltime CPUtime*
Tess 3.04 13.9 30 31.2 3.0 2.8
Cube 15.1 29.5 30.7 3.4 3.1
Tess+Cube 11.0 24.2 25.4 5.7 5.3
LSTM 7.6 20.9 20.8 1.5 2.5

Note in the above table that LSTM is faster than Tess 3.04 (without adding cube) in both wall time and CPU time! For wall time by a factor of 2.

Median of three results from test on HP Z420 on a single Hindi page.

Test Mode Real User
Original (cube + tess) 7.6 7.3
Base Tess 2.9 2.6
Cube 5.4 4.9
LSTM With OpenMP+AVX 1.8 3.8
LSTM No OpenMP with AVX 2.7 2.4
LSTM No OpenMP with SSE 3.1 2.7
LSTM No OpenMP no SIMD at all 4.6 4.1

My first test with a simple screenshot gave significant better results with LSTM, but needed 16 minutes CPU time (instead of 9 seconds) with a debug build of Tesseract (-O0). A release build (-O2) needs 17 seconds with LSTM, 4 seconds without for the same image.

The slow speed with debug is to be expected. The new code is much more memory intensive, so it is a lot slower on debug (also openmp is turned off by choice on debug). The optimized build speed sounds about right for Latin-based languages. It is the complex scripts that will run faster relative to base Tesseract.

Ref: https://github.com/tesseract-ocr/tesseract/issues/40