Tesseract 4 and 5 is about 100-150 times slower than 3 on my Linux system. #2611

ripefig · 2019-08-11T23:59:20Z

Environment

Tesseract Version:

> tesseract -v

tesseract 4.0.0
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found SSE

tesseract-snap -v

tesseract 5.0.0-alpha-335-gae02
 leptonica-1.74.2
  libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8
 Found AVX2
 Found AVX
 Found FMA
 Found SSE

tesseract-ocr-eng : 1:4.00~git30-7274cfa-1

I used the training data from the ubuntu repos for both tesseract and tesseract-snap , since no data is provided with the snap.

Platform:
Operating System: Kubuntu 19.04
KDE Plasma Version: 5.15.4
KDE Frameworks Version: 5.56.0
Qt Version: 5.12.2
Kernel Version: 5.0.0-21-generic
OS Type: 64-bit
Processors: 4 × Intel® Core™ i7-4600U CPU @ 2.10GHz
Memory: 11.6 GiB of RAM

Current Behavior:

It takes over a minute of 100% CPU load to scan an image (directly below) with two sentences :

results for tesseract 4:
> time tesseract -l eng 62771160-3bf82880-baa5-11e9-8d39-4d9c4381a093.png 1

Tesseract Open Source OCR Engine v4.0.0 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143

real    1m9.096s
user    3m7.484s
sys     0m0.335s

Tesseract 5:

> time tesseract-snap -l eng 62771160-3bf82880-baa5-11e9-8d39-4d9c4381a093.png 1

Tesseract Open Source OCR Engine v5.0.0-alpha-335-gae02 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143

real    1m13.585s
user    3m16.104s

I tried to OCR a one page doc, but I had to exit the psenterocess. It would probably take one hour of full CPU load.Unfortunately I don't have Tesseract 3 to compare, but I remember using it in an OCR screenshotting script it felt as fast as regular copy and paste, so definitely under two seconds for this block of text.

Expected Behavior:

It shouldn't take this long to scan two sentences.

Suggested Fix

Disable multithreading by default until its fixed.

The text was updated successfully, but these errors were encountered:

ripefig · 2019-08-12T00:18:50Z

The solution is to set OMP_THREAD_LIMIT=1
Shouldn't multithreading be disabled by default until it's fixed?

#898

stweil · 2019-08-12T09:24:05Z

I cannot reproduce your timing results on a recent Debian system:

$ tesseract --version
tesseract 4.0.0
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found SSE
$ time tesseract -l eng 62841051-2b65cf00-bcac-11e9-97df-bf85ff0b09bf.png 1
Tesseract Open Source OCR Engine v4.0.0 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143

real	0m0,209s
user	0m0,497s
sys	0m0,024s

With OMP_THREAD_LIMIT=1, it takes a little longer:

$ time tesseract -l eng 62841051-2b65cf00-bcac-11e9-97df-bf85ff0b09bf.png 1
Tesseract Open Source OCR Engine v4.0.0 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143

real	0m0,255s
user	0m0,247s
sys	0m0,008s

Shreeshrii · 2019-08-12T10:10:33Z

Is the test image available somewhere? I would like to try it on a non-AVX system.

stweil · 2019-08-12T10:27:28Z

Is the test image available somewhere? I would like to try it on a non-AVX system.

It's given in the initial report: https://user-images.githubusercontent.com/45201036/62841051-2b65cf00-bcac-11e9-97df-bf85ff0b09bf.png.

Even without AVX it should not take more than a second.

Shreeshrii · 2019-08-12T10:44:31Z

Thanks @stweil .

Here are the results on my system - Linux tesseract-ocr 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:54:50 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux

ubuntu@tesseract-ocr:~/TEST$  time tesseract -l eng 62841051-2b65cf00-bcac-11e9-97df-bf85ff0b09bf.png 1
Tesseract Open Source OCR Engine v5.0.0-alpha-332-gb839 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143

real    0m15.212s
user    0m9.774s
sys     0m0.186s

ubuntu@tesseract-ocr:~/TEST$ OMP_THREAD_LIMIT=1 time tesseract -l eng 62841051-2b65cf00-bcac-11e9-97df-bf85ff0b09bf.png 1
Tesseract Open Source OCR Engine v5.0.0-alpha-332-gb839 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143
3.09user 0.02system 0:03.11elapsed 99%CPU (0avgtext+0avgdata 88064maxresident)k
0inputs+128outputs (0major+2175minor)pagefaults 0swaps

ubuntu@tesseract-ocr:~/TEST$ tesseract -v
tesseract 5.0.0-alpha-332-gb839
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.4.4 : libopenjp2 2.3.0

IBM POWER8 - 8 CPU, 24 GB RAM

ripefig · 2019-08-12T23:09:09Z

@Shreeshrii seems like you're have the same problem, given that your system is much more powerful.

stweil · 2019-08-13T07:22:06Z

Test result on an ARM system:

# tessdata_fast
real	0m1.864s
user	0m4.778s
sys	0m0.147s

With export OMP_THREAD_LIMIT=1:

# tessdata_fast
real	0m2.078s
user	0m1.950s
sys	0m0.099s

The results for 4.0.0 and latest Git master are similar.

stweil · 2019-08-13T07:26:09Z

@ripefig, your results could be explained if Tesseract cannot get 4 CPU cores. On my ARM system which has 4 cores I get a faster result with export OMP_THREAD_LIMIT=2:

# tessdata_fast
real	0m1.426s
user	0m1.969s
sys	0m0.159s

That also reduces the huge overhead in the user time which occurs with 4 threads.

ripefig · 2019-08-13T17:14:14Z

$ time OMP_THREAD_LIMIT=1  tesseract -l eng  62771160-3bf82880-baa5-11e9-8d39-4d9c4381a093.png 1
Tesseract Open Source OCR Engine v4.0.0 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143

real    0m0.366s
user    0m0.346s
sys     0m0.012s

$ time OMP_THREAD_LIMIT=3  tesseract -l eng  62771160-3bf82880-baa5-11e9-8d39-4d9c4381a093.png 1
Tesseract Open Source OCR Engine v4.0.0 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143

real    0m1.933s
user    0m3.652s
sys     0m0.037s


$ time OMP_THREAD_LIMIT=2  tesseract -l eng  62771160-3bf82880-baa5-11e9-8d39-4d9c4381a093.png 1
Tesseract Open Source OCR Engine v4.0.0 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143

real    0m0.732s
user    0m0.757s
sys     0m0.032s

ripefig · 2019-08-15T16:42:32Z

@stweil Is there any solution? Maybe limit the default number of cores to 1 (or max cores - 1) until Tesseract can reliably work with all cores? Seems like it's completely broken for a lot of users and problem has persisted for years. This also breaks all the software that uses tesseract.

stweil · 2019-08-15T17:03:02Z

The right solution depends on your hardware (number of cores, memory interface) and your use case: on some hardware using more than one core results in faster OCR (see my results above), and training is much faster with 4 cores. It is always possible to either set OMP_THREAD_LIMIT or to build your own binary without multithreading. Without that, Tesseract is not "completely broken" or unreliable, but simply slow. I know that is not nice. The Windows binaries from UB Mannheim are therefore built without multithreading.

Because there are acceptable solutions for the speed issue, my current first priority is improving quality, not looking how to improve multithreading. If you or someone else finds a better solution for multithreading, a pull request would be welcomed.

ripefig · 2019-08-15T17:45:22Z

Out of the box, it takes about one hour to OCR a single page of text. It would take one month to OCR a textbook, and the CPU would probably fry. I think most users would consider this "completely broken," in the sense of not being usable.

The issue affects both AVX and non-AVX systems. The program is capable of cutting down times by two orders of magnitude in both cases, as demonstrated in this thread. Why not just limit the core count by default until the issue is fixed?

Of course, one could argue that it's up to application developers to make sure tesseract works on the target system. (I just tried a few OCR apps and most of them work fine - so it looks like they are fixing it on their end somehow).

stweil · 2019-08-15T18:01:06Z

That's simply not true. It is slow on your notebook. On my six year old notebook (no AVX, 4 x Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz) the official Debian package works pretty good:

$ time tesseract 62841051-2b65cf00-bcac-11e9-97df-bf85ff0b09bf.png -
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143

real	0m0.327s
user	0m0.796s
sys	0m0.024s

ripefig · 2019-08-15T18:09:13Z

I didn't say it affects all systems, but it's frequent enough to warrant some kind of change. Multicore might result in 5-30% speed improvement in certain cases but it can also result in a 10000% speed decrease on many systems. Intel® Core™ i7-4600U CPU isn't exactly an exotic chipset.

Perhaps you're saying this is an Ubuntu issue?

Based on a user suggestion and tesseract-ocr/tesseract#2611, I reviewed thread limits and found that thread limit of 3 is still beneficial, but not 4. > time env OMP_THREAD_LIMIT=2 tesseract omp4.png stdout >/dev/null Warning: Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 143 116.67user 1.67system 1:26.26elapsed 137%CPU (0avgtext+0avgdata 356752maxresident)k 2213inputs+0outputs (18major+131059minor)pagefaults 0swaps > time env OMP_THREAD_LIMIT=3 tesseract omp4.png stdout >/dev/null Warning: Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 143 136.89user 1.63system 1:19.56elapsed 174%CPU (0avgtext+0avgdata 356784maxresident)k 821inputs+0outputs (0major+131080minor)pagefaults 0swaps > time env OMP_THREAD_LIMIT=4 tesseract omp4.png stdout >/dev/null Warning: Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 143 161.31user 1.51system 1:18.80elapsed 206%CPU (0avgtext+0avgdata 356632maxresident)k 8477inputs+0outputs (12major+131074minor)pagefaults 0swaps > time env OMP_THREAD_LIMIT=8 tesseract omp4.png stdout >/dev/null Warning: Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 143 160.30user 1.62system 1:18.01elapsed 207%CPU (0avgtext+0avgdata 356640maxresident)k 821inputs+0outputs (0major+131078minor)pagefaults 0swaps

amitdo · 2019-10-21T09:08:20Z

Processors: 4 × Intel® Core™ i7-4600U CPU @ 2.10GHz

https://ark.intel.com/content/www/us/en/ark/products/76616/intel-core-i7-4600u-processor-4m-cache-up-to-3-30-ghz.html

# of Cores 2

dagnelies · 2019-10-29T17:47:35Z

Indeed, that whole multithreading thing caused more harm than good. There are a few issues around regarding this. I believe some people even compile specialized versions where OMP is completely removed since it runs more or less as fast, but with way less CPU consumption.

zdenop · 2019-10-30T14:30:56Z

closing as duplicate to #263

Shreeshrii · 2019-11-08T17:22:22Z

@stweil I want to compare the timing on Power8 to AVX2. I notice that the results you reported were with tesseract 4.0.0. Please rerun the test with the latest code.

My current result is:

 time tesseract -l eng 62841051-2b65cf00-bcac-11e9-97df-bf85ff0b09bf.png 1
Tesseract Open Source OCR Engine v5.0.0-alpha-537-g6f31 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143

real    0m1.109s
user    0m3.524s
sys     0m0.032s

stweil · 2019-11-17T12:26:59Z

@Shreeshrii, my results for Power8 differ significantly when I use tessdata_fast. Did you test with tessdata_best? Power8 can be improved a lot by using SIMD. For tessdata_best that is easy to implement.

Intermediate results (more results will get added later) with git master:

# Power8, fast, configure (default options)
real	0m0.802s
user	0m1.815s
sys	0m0.030s

# Power8, fast, configure --disable-openmp --disable-shared
real	0m1.243s
user	0m1.231s
sys	0m0.012s

# Power8, best, configure (default options)
real	0m1.329s
user	0m3.804s
sys	0m0.031s

# Power8, best, configure (default options), OMP_THREAD_LIMIT=1
real	0m3.155s
user	0m3.139s
sys	0m0.019s

# Power8, best, configure (default options), SIMD
real	0m1.144s
user	0m2.748s
sys	0m0.045s

# Power8, best, configure (default options), SIMD, OMP_THREAD_LIMIT=1
real	0m1.858s
user	0m1.842s
sys	0m0.019s

# Power8, best, configure --disable-openmp --disable-shared
real	0m2.981s
user	0m2.957s
sys	0m0.024s

# Power8, best, configure --disable-openmp --disable-shared, SIMD
real	0m1.686s
user	0m1.669s
sys	0m0.016s

stweil · 2019-11-18T06:42:13Z

Do any compile options also need to be changed?

Of course it needs OpenMP (otherwise the compiler will raise an error with the current patch), so -fopenmp is required for that file even if OpenMP was disabled for the rest of the code. I also added -maltivec and -mabi=altivec in my test. Maybe -mcpu=native could be added for all files.

stweil · 2019-11-18T07:53:02Z

New results for an ARMv8 based NVIDIA Xavier running Ubunto Bionic:

# best, configure --disable-openmp --disable-shared
real	0m5.502s
user	0m5.352s
sys	0m0.080s

# best, configure --disable-openmp --disable-shared, SIMD
real	0m3.534s
user	0m3.400s
sys	0m0.080s

Tesseract must be called with -c dotproduct=native to use SIMD.

stweil · 2019-11-18T08:18:34Z

The old ARMv7 results were made with tessdata_fast. Here are new results:

# best, configure --disable-openmp --disable-shared
real	0m7.218s
user	0m6.859s
sys	0m0.248s

For this host, SIMD makes no difference. Tesseract uses NEON anyway.

Shreeshrii · 2019-11-18T09:05:22Z

Can a different dot product calculation be used for Altivec - please see slide 16 in https://www.nxp.com/files-static/training_presentation/TP_ALTIVEC.pdf

int FastVectorDotProduct( vector float *v1, vector float *v2, int length ){
vector float temp = (vector float) vec_splat_s8(0);
vector float temp2 = temp; vector float temp3 = temp;
vector float temp4 = temp; vector float result;
for( int i = 0; i < length; i += 4){ //Loop over the length of the vectors,
temp = vec_madd( v1[i], v2[i], temp); //this time doing 4 vectors in parallel
temp2 = vec_madd( v1[i+1], v2[i+1], temp2); // to fill the pipeline
temp3 = vec_madd( v1[i+2], v2[i+2], temp3);
temp4 = vec_madd( v1[i+3], v2[i+3], temp4);
}
//Sum our temp vectors
temp = vec_add( temp, temp2 );
temp3 = vec_add( temp3, temp4 );
temp = vec_add( temp, temp3 );
//Add across the vector
temp = vec_add( temp, vec_sld( temp, temp, 4 ));
temp = vec_add(temp, vec_sld( temp, temp, 8 ));
//Copy the result to the stack so we can return it via the IPU
vec_ste( temp, 0, &result );
return result;
}

amitdo · 2019-11-18T09:27:35Z

It's done automatically with the openmp-simd code.

amitdo · 2019-11-18T09:33:30Z

@stweil, did you benchmarked this against the manual code in a x86-64 machine?

Shreeshrii · 2019-11-18T09:48:15Z

Of course it needs OpenMP (otherwise the compiler will raise an error with the current patch), so -fopenmp is required for that file even if OpenMP was disabled for the rest of the code.

Shouldn't --enable-openmp set -fopenmp?

I added the following to my build script.

export CXXFLAGS="-fopenmp -maltivec -mabi=altivec -mcpu=power8"

Now, tesseract --version reports about OPENMP - haven't seen it before with --enable-openmp builds.

 tesseract -v
tesseract 5.0.0-alpha-554-g9ed3
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.4.4 : libopenjp2 2.3.0
 Found OpenMP 201511

amitdo · 2019-11-18T10:57:54Z

best, configure --disable-openmp --disable-shared

Of course it needs OpenMP (otherwise the compiler will raise an error with the current patch), so -fopenmp is required for that file even if OpenMP was disabled for the rest of the code.

@stweil, It does not make sense to disable openmp and then to enable it.

amitdo · 2019-11-18T11:07:43Z

clang and gcc (>=4.9) both support the flag -fopenmp-simd.

amitdo · 2019-11-18T11:18:21Z

This code is more complete.

74f72e1#diff-d0fa47c1b7e2cb742a89b8c8f824df62R343

Shreeshrii · 2019-11-19T09:17:21Z

	fast default options	fast OMP_THREAD_LIMIT=1
ripefig - intel	real 1m9.096s user 3m7.484s sys 0m0.335s	real 0m0.366s user 0m0.346s sys 0m0.012s
stweil - intel	real 0m0.327s user 0m0.796s sys 0m0.024s
stweil - ARMv7	real 0m1.864s user 0m4.778s sys 0m0.147s	real 0m2.078s user 0m1.950s sys 0m0.099s
stweil - Power8	real 0m0.802s user 0m1.815s sys 0m0.030s
stweil - Recent Debian	real 0m0,209s user 0m0,497s sys 0m0,024s	real 0m0,255s user 0m0,247s sys 0m0,008s

Shreeshrii · 2019-11-19T09:24:50Z

	best default options	best OMP_THREAD_LIMIT=1	best --disable-openmp --disable-shared	best --disable-openmp --disable-shared SIMD
ARMv7			real 0m7.218s user 0m6.859s sys 0m0.248s	No Diff
ARMv8			real 0m5.502s user 0m5.352s sys 0m0.080s	real 0m3.534s user 0m3.400s sys 0m0.080s
Power8	real 0m1.329s user 0m3.804s sys 0m0.031s	real 0m3.155s user 0m3.139s sys 0m0.019s	real 0m2.981s user 0m2.957s sys 0m0.024s	real 0m1.686s user 0m1.669s sys 0m0.016s

Shreeshrii · 2019-11-19T09:27:49Z

What is obvious from the timing results is that there is a lot of variation across platforms and across options.

I saw a lot of variation in time even on the same platform - see #2611 (comment)

Shreeshrii · 2019-11-24T08:03:05Z

The training on Power8 should be much faster with SIMD.

@stweil Which parts of training process will be speeded up by this? lstmtraining? I will like to test/benchmark with/without the suggested SIMD test patch.

stweil · 2019-11-24T08:12:04Z

Yes, lstmtraining will be faster, both for the training part and for the evaluation. That process uses up to two cores when OpenMP is disabled or up to 8 cores with OpenMP.

But I still do not know how dotproduct=native can be enabled for lstmtraining.

Shreeshrii · 2019-11-24T08:16:19Z

You had also mentioned earlier, in a different thread, about unrolled loops. Should that also be implemented along with this for Power?

#2106 (comment)

stweil · 2019-11-24T08:20:35Z

Ideally loop unrolling should also be done by the compiler (try -O3 or -funroll-loops).

Shreeshrii · 2019-11-24T10:23:18Z

I set both -O3 -ffast-math (similar to

tesseract/CMakeLists.txt

Line 254 in 2b68898

set(MARCH_NATIVE_FLAGS "${MARCH_NATIVE_FLAGS} -O3 -ffast-math")

) and the unittest linlsq_test failed with the following error. It works when I removed the -ffast-math.


Running main() from ../../unittest/../googletest/googletest/src/gtest_main.cc
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from LLSQTest
[ RUN      ] LLSQTest.BasicLines
[       OK ] LLSQTest.BasicLines (0 ms)
[ RUN      ] LLSQTest.Vectors
../../unittest/linlsq_test.cc:63: Failure
The difference between correct_vector.y() and vector.y() is 2, which exceeds tolerance, where
correct_vector.y() evaluates to 1,
vector.y() evaluates to -1, and
tolerance evaluates to 9.9999999747524271e-07.
[  FAILED  ] LLSQTest.Vectors (1 ms)
[ RUN      ] LLSQTest.RmsOrthWorksAsIntended
[       OK ] LLSQTest.RmsOrthWorksAsIntended (0 ms)
[----------] 3 tests from LLSQTest (1 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (1 ms total)
[  PASSED  ] 2 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] LLSQTest.Vectors

Shreeshrii · 2019-11-25T14:59:16Z

lstmtraining will be faster, both for the training part and for the evaluation. That process uses up to two cores when OpenMP is disabled or up to 8 cores with OpenMP.

Is using up to 8 cores = 8 threads?

tesseract/src/ccstruct/imagedata.h

Line 315 in 944c1d9

// A collection of DocumentData that knows roughly how much memory it is using.

// A collection of DocumentData that knows roughly how much memory it is using.
// Note that while it supports background read-ahead, it assumes that a single
// thread is accessing documents, ie it is not safe for multiple threads to
// access different documents in parallel, as one may de-cache the other's
// content.

stweil · 2019-11-25T15:15:47Z

Yes, I should have written "up to 8 threads". If there are only two CPUs with hyperthreading, those 8 threads will run on 4 cores, and the performance will be rather low. Of course you can use OMP_THREAD_LIMIT=4 to handle this, but I am not sure how that will distribute the cores for training and evaluation.

The cache for image data works also with a separate thread, but that does not use OpenMP, so it also works when OpenMP was disabled.

Shreeshrii · 2019-11-25T15:28:42Z

//ie it is not safe for multiple threads to
// access different documents in parallel, as one may de-cache the other's
// content.

I was asking about threads with regard to the above comment, whether multiple threads lead to slowing down.

Shreeshrii · 2019-11-25T17:04:34Z

Tesstutorial Phases	Master	#pragma omp simd	omp simd reduction
1-makedata tesstrain.sh	real 87.50 user 107.15 sys 4.85	real 67.93 user 107.51 sys 4.88	real 67.77 user 107.19 sys 5.06
2-scratch lstmtraining lstmeval	real 7170.90 user 21735.34 sys 124.71 Error rate = 0.626	real 7244.33 user 22014.08 sys 128.46 Error rate = 0.751	real 7187.00 user 21784.96 sys 124.24 Error rate = 0.704
3-impact-from-small lstmtraining lstmeval	real 643.20 user 1893.29 sys 12.16 Error rate = 0.027	real 654.52 user 1936.23 sys 14.80 Error rate = 0.036	real 641.19 user 1873.26 sys 11.54 Error rate = 0.059
4-impact-from-full lstmtraining lstmeval	real 1407.57 user 4727.62 sys 16.75 Error rate = 0.307	real 1464.04 user 4887.35 sys 20.09 Error rate = 0.298	real 1407.77 user 4738.13 sys 17.04 Error rate = 0.269
5-makedata-plusminus tesstrain.sh	real 91.26 user 111.95 sys 4.79	real 71.00 user 124.34 sys 5.82	real 68.00 user 112.19 sys 4.60
6-plusminus lstmtraining lstmeval	real 5975.63 user 18346.83 sys 60.97 Error rate = 0.013	real 7539.99 user 20610.39 sys 95.87 Error rate = 0.019	real 5956.91 user 18285.98 sys 62.21 Error rate = 0.025
7-layer lstmtraining lstmeval	real 2808.52 user 8775.01 sys 61.81 Error rate = 3.946	real 2793.53 user 8661.41 sys 51.00 Error rate = 4.012	real 1865.26 user 5614.33 sys 20.48 Error rate = 3.886

Shreeshrii · 2019-11-25T17:10:01Z

I have posted above the results of my test on power8 running tesstutorial (using scripts in shreeshrii/tess4training) with tesseract built from git master vs with SIMD patch as suggested by @stweil.

I ran the scripts one by one, without any other process running in the VM so as to get results that should be comparable.

The build was done using Advanced toolchain rather than the distro's gcc since

AT is highly recommended when you want to build an optimized CPU-bound application on POWER. ref: https://developer.ibm.com/linuxonpower/advance-toolchain/advtool-faq/

PATH=/opt/at12.0/bin:/opt/at12.0/sbin:$PATH gcc --version
gcc (GCC) 8.3.1 20190304 (Advance-Toolchain-at12.0) [revision 269374]

Build options in both cases included the following:

export CXXFLAGS="-O3 -maltivec -mabi=altivec -mcpu=power8 -mtune=power8 -fopenmp"

../../configure --enable-openmp --disable-debug --disable-opencl --disable-graphics --disable-shared --with-tensorflow=no

Shreeshrii · 2019-11-25T17:17:00Z

Questions:

Is it OK to set the build options for all programs as I have done above?
I am planning also to test using @amitdo 's suggestion to use

#pragma omp simd reduction(+:total)

Should I expect the result to be very different from

#pragma omp simd

Is it OK to use the advanced toolchain?

  PATH=/opt/at12.0/bin:/opt/at12.0/sbin:$PATH gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/opt/at12.0/libexec/gcc/powerpc64le-linux-gnu/8.3.1/lto-wrapper
Target: powerpc64le-linux-gnu
Configured with: /build/at12.0_Ubuntu16_ppc64le-ppc64le/12/at12.0-1.ubuntu-16_ppc64le_ppc64le/sources/gcc/configure --build=powerpc64le-linux-gnu --host=powerpc64le-linux-gnu --target=powerpc64le-linux-gnu --with-cpu=default64 --prefix=/opt/at12.0 --with-long-double-128 --enable-secureplt --disable-multilib --with-advance-toolchain=at12.0 --with-glibc-version=2.28 --with-local-prefix=/opt/at12.0 --enable-threads=posix --enable-languages=c,c++,fortran,go --enable-__cxa_atexit --enable-shared --enable-checking=release --enable-lto --enable-gnu-indirect-function --enable-initfini-array --enable-linker-build-id --with-system-zlib --with-gmp-include=/opt/at12.0/include --with-gmp-lib=/opt/at12.0/lib64 --with-mpfr-include=/opt/at12.0/include --with-mpfr-lib=/opt/at12.0/lib64 --with-mpc-include=/opt/at12.0/include --with-mpc-lib=/opt/at12.0/lib64 --without-ppl --without-cloog --without-libelf --with-host-libstdcxx='-L/opt/at12.0/lib64 -lstdc++ -lsupc++ -lgmp -lgmpxx -lm' --with-cpu=power8 --with-tune=power8
Thread model: posix
gcc version 8.3.1 20190304 (Advance-Toolchain-at12.0) [revision 269374] (GCC)

or should i use the following?

 gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/powerpc64le-linux-gnu/7/lto-wrapper
Target: powerpc64le-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 7.4.0-1ubuntu1~18.04.1' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=powerpc64le-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --enable-objc-gc=auto --enable-secureplt --with-cpu=power8 --enable-targets=powerpcle-linux --disable-multilib --enable-multiarch --disable-werror --with-long-double-128 --enable-checking=release --build=powerpc64le-linux-gnu --host=powerpc64le-linux-gnu --target=powerpc64le-linux-gnu
Thread model: posix
gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)

Shreeshrii · 2019-11-26T10:25:48Z

Updated training test results in #2611 (comment)

Danx69 · 2021-08-14T16:21:34Z

Tesseract was very slow running it within a script but I noticed it was very fast within a terminal then, using also your suggestions, I modify the script line in "OMP_THREAD_LIMIT=1 xterm -geometry 1X1+0+0 -e tesseract file1 file2" and obtained the same speed.

ripefig changed the title ~~Tesseract 4 and 5 is about 200 times slower than 3 on my Linux system.~~ Tesseract 4 and 5 is about 100-200 times slower than 3 on my Linux system. Aug 12, 2019

ripefig mentioned this issue Aug 12, 2019

Slow recognition due to multithreading issues in Tesseract 4 and 5 danpla/dpscreenocr#2

Closed

ripefig changed the title ~~Tesseract 4 and 5 is about 100-200 times slower than 3 on my Linux system.~~ Tesseract 4 and 5 is about 100-150 times slower than 3 on my Linux system. Aug 12, 2019

stweil added performance question labels Aug 12, 2019

This comment has been minimized.

Sign in to view

jon-sch mentioned this issue Aug 28, 2019

How to do Threading Properly? sirfz/tesserocr#191

Closed

jbarlow83 referenced this issue in ocrmypdf/OCRmyPDF Sep 21, 2019

Decide on OMP_THREAD_LIMIT more intelligently

5ee4541

zdenop closed this as completed Oct 30, 2019

zdenop added the duplicate label Oct 30, 2019

bradosia mentioned this issue Mar 23, 2020

This library is slower than linux bradosia/mingw-w64-x86_64-static-tesseract#2

Open

amitdo added OpenMP SIMD labels May 14, 2020

ale82to mentioned this issue Nov 5, 2020

amazing but really slow agnostic-apollo/tesseract-for-android#1

Open

Tesseract 4 and 5 is about 100-150 times slower than 3 on my Linux system. #2611

Tesseract 4 and 5 is about 100-150 times slower than 3 on my Linux system. #2611

Comments

ripefig commented Aug 11, 2019 • edited

Environment

Current Behavior:

Expected Behavior:

Suggested Fix

ripefig commented Aug 12, 2019

stweil commented Aug 12, 2019 • edited

Shreeshrii commented Aug 12, 2019

This comment has been minimized.

stweil commented Aug 12, 2019

Shreeshrii commented Aug 12, 2019

ripefig commented Aug 12, 2019 • edited

stweil commented Aug 13, 2019 • edited

stweil commented Aug 13, 2019 • edited

ripefig commented Aug 13, 2019

ripefig commented Aug 15, 2019 • edited

stweil commented Aug 15, 2019

ripefig commented Aug 15, 2019

stweil commented Aug 15, 2019 • edited

ripefig commented Aug 15, 2019 • edited

amitdo commented Oct 21, 2019

dagnelies commented Oct 29, 2019

zdenop commented Oct 30, 2019

Shreeshrii commented Nov 8, 2019

stweil commented Nov 17, 2019 • edited

stweil commented Nov 18, 2019

stweil commented Nov 18, 2019 • edited

stweil commented Nov 18, 2019 • edited

Shreeshrii commented Nov 18, 2019 • edited

amitdo commented Nov 18, 2019

amitdo commented Nov 18, 2019

Shreeshrii commented Nov 18, 2019

amitdo commented Nov 18, 2019

amitdo commented Nov 18, 2019

amitdo commented Nov 18, 2019

Shreeshrii commented Nov 19, 2019 • edited

Shreeshrii commented Nov 19, 2019

Shreeshrii commented Nov 19, 2019

Shreeshrii commented Nov 24, 2019

stweil commented Nov 24, 2019 • edited

Shreeshrii commented Nov 24, 2019

stweil commented Nov 24, 2019

Shreeshrii commented Nov 24, 2019

Shreeshrii commented Nov 25, 2019

stweil commented Nov 25, 2019 • edited

Shreeshrii commented Nov 25, 2019

Shreeshrii commented Nov 25, 2019 • edited

Shreeshrii commented Nov 25, 2019 • edited

Shreeshrii commented Nov 25, 2019 • edited

Shreeshrii commented Nov 26, 2019

Danx69 commented Aug 14, 2021

ripefig commented Aug 11, 2019 •

edited

stweil commented Aug 12, 2019 •

edited

ripefig commented Aug 12, 2019 •

edited

stweil commented Aug 13, 2019 •

edited

stweil commented Aug 13, 2019 •

edited

ripefig commented Aug 15, 2019 •

edited

stweil commented Aug 15, 2019 •

edited

ripefig commented Aug 15, 2019 •

edited

stweil commented Nov 17, 2019 •

edited

stweil commented Nov 18, 2019 •

edited

stweil commented Nov 18, 2019 •

edited

Shreeshrii commented Nov 18, 2019 •

edited

Shreeshrii commented Nov 19, 2019 •

edited

stweil commented Nov 24, 2019 •

edited

stweil commented Nov 25, 2019 •

edited

Shreeshrii commented Nov 25, 2019 •

edited

Shreeshrii commented Nov 25, 2019 •

edited

Shreeshrii commented Nov 25, 2019 •

edited