Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Best Practices re OPENMP - for training, evaluation and recognition #3744

Open
Shreeshrii opened this issue Feb 6, 2022 · 21 comments
Open

Comments

@Shreeshrii
Copy link
Collaborator

For Tesseract 5 what are the best practices regarding OPENMP.

Is it still true:

  1. OPENMP is needed for training so build tesseract and training tools with --enable-openmp.
  2. For lstmeval (built with --enable-openmp), use OMP_THREAD_LIMIT=1.
  3. For recognition with tesseract (built with --enable-openmp), use OMP_THREAD_LIMIT=1.
@Shreeshrii Shreeshrii changed the title Best Practices re OPENMP - for training, evaluation and recognition RFC: Best Practices re OPENMP - for training, evaluation and recognition Feb 6, 2022
@stweil
Copy link
Contributor

stweil commented Feb 6, 2022

OPENMP is not needed for training. It even makes things worse for me. Timing results for lstm_squashed_test on AMD EPYC 7502 show that no OPENMP (--disable-openmp) is best, followed by disabled OPENMP (OMP_THREAD_LIMIT=1). Enabled OPENMP is the last and burns a lot of CPU performance for nothing:

# --disable-openmp
real 28.41
user 28.33
sys 0.08
# --enable-openmp
real 33.16
user 129.41
sys 1.46
# --enable-openmp, OMP_THREAD_LIMIT=1
real 32.89
user 32.61
sys 0.28

@amitdo
Copy link
Collaborator

amitdo commented Feb 6, 2022

The plan is to disable it by default in 5.1.0.

@stweil
Copy link
Contributor

stweil commented Feb 6, 2022

... in autoconf builds. cmake already disables it by default.

@stweil
Copy link
Contributor

stweil commented Feb 6, 2022

Note that even without OPENMP training uses up to two CPU threads, one for training which runs until training is finished and one for evaluation which runs from time to time during the training process.

@amitdo
Copy link
Collaborator

amitdo commented Feb 6, 2022

The reason for disabling OpenMP is that Tesseract currently uses it inefficiently.

For text recognition the speed benefit for using OpenMP with fast / tessdata (best->int) traineddata is too small while it consumes too much CPU resources.

For training the OpenMP code is even more problematic than the code used for text recognition. I'm not sure how much speed will be lost here.

@Shreeshrii
Copy link
Collaborator Author

Thank you!

no OPENMP is best, followed by disabled OPENMP

Does no OPENMP mean building with --disable-openmp as part of autotools configure?

@stweil
Copy link
Contributor

stweil commented Feb 6, 2022

Yes, currently it is necessary to use configure --disable-openmp. As Amit has written above that should be the default, but I still have no simple code to achieve that.

I updated my comment to be clearer.

@amitdo
Copy link
Collaborator

amitdo commented Feb 6, 2022

--disable-openmp disables OpenMP at compile time, while OMP_THREAD_LIMIT=1 disables it at runtime. The first method is more efficient, while the second method is more flexible.

@amitdo
Copy link
Collaborator

amitdo commented Feb 6, 2022

Stefan, for 5.1.0, do you want to keep a way to enable OpenMP with --enable-openmp?

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Feb 6, 2022

OPENMP is not needed for training. It even makes things worse for me. Timing results for lstm_squashed_test on AMD EPYC 7502 show that no OPENMP (--disable-openmp) is best, followed by disabled OPENMP (OMP_THREAD_LIMIT=1). Enabled OPENMP is the last and burns a lot of CPU performance for nothing:

Ok. I will try to test training from font scenarios in my tess5train-fonts repo to see if they get similar results.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Feb 6, 2022

lstmeval

Which time figures (real, user, sys) are important? Which scenario is preferable?

no OPENMP (--disable-openmp)

tesseract 5.0.1-19-g44ddde
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found NEON
 Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1
 Found libcurl/7.58.0 NSS/3.35 zlib/1.2.11 libidn2/2.0.4 libpsl/0.19.1 (+libidn2/2.0.4) nghttp2/1.30.0 librtmp/2.3
time -p lstmeval  \
	--verbosity=0 \
	--model data/engFineTuned/tessdata_fast/engFineTuned_0.631000_121_600.traineddata \
	--eval_listfile data/engFineTuned/list.eval 2>&1 | grep "^BCER eval" > data/engFineTuned/tessdata_fast/engFineTuned_0.631000_121_600.eval.log
real 805.37
user 805.34
sys 0.03
time -p lstmeval  \
	--verbosity=0 \
	--model data/engFineTuned/tessdata_fast/engFineTuned_0.028000_156_2000.traineddata \
	--eval_listfile data/engFineTuned/list.eval 2>&1 | grep "^BCER eval" > data/engFineTuned/tessdata_fast/engFineTuned_0.028000_156_2000.eval.log
real 806.56
user 806.49
sys 0.07
time -p lstmeval  \
	--verbosity=0 \
	--model data/engFineTuned/tessdata_fast/engFineTuned_0.558000_125_700.traineddata \
	--eval_listfile data/engFineTuned/list.eval 2>&1 | grep "^BCER eval" > data/engFineTuned/tessdata_fast/engFineTuned_0.558000_125_700.eval.log
real 806.10
user 806.04
sys 0.07

Enabled OPENMP

older version of tesseract 5.0.1 built with --enable-openmp
time -p lstmeval  \
	--verbosity=0 \
	--model data/engFineTuned/tessdata_fast/engFineTuned_0.645000_119_600.traineddata \
	--eval_listfile data/engFineTuned/list.eval 2>&1 | grep "^BCER eval" > data/engFineTuned/tessdata_fast/engFineTuned_0.645000_119_600.eval.log
real 331.53
user 1041.90
sys 9.02
time -p lstmeval  \
	--verbosity=0 \
	--model data/engFineTuned/tessdata_fast/engFineTuned_0.119000_156_1500.traineddata \
	--eval_listfile data/engFineTuned/list.eval 2>&1 | grep "^BCER eval" > data/engFineTuned/tessdata_fast/engFineTuned_0.119000_156_1500.eval.log
real 331.30
user 1042.38
sys 8.55
time -p lstmeval  \
	--verbosity=0 \
	--model data/engFineTuned/tessdata_fast/engFineTuned_0.014000_165_2500.traineddata \
	--eval_listfile data/engFineTuned/list.eval 2>&1 | grep "^BCER eval" > data/engFineTuned/tessdata_fast/engFineTuned_0.014000_165_2500.eval.log
real 331.70
user 1042.77
sys 8.97

@Shreeshrii
Copy link
Collaborator Author

lstmeval - engImpact

No OPENMP

time -p lstmeval  \
	--verbosity=0 \
	--model data/engImpact/tessdata_fast/engImpact_0.489000_152_900.traineddata \
	--eval_listfile data/engImpact/list.eval 2>&1 | grep "^BCER eval" > data/engImpact/tessdata_fast/engImpact_0.489000_152_900.eval.log
real 19.85
user 19.82
sys 0.04

OPENMP

time -p lstmeval  \
	--verbosity=0 \
	--model data/engImpact/tessdata_fast/engImpact_0.489000_152_900.traineddata \
	--eval_listfile data/engImpact/list.eval 2>&1 | grep "^BCER eval" > data/engImpact/tessdata_fast/engImpact_0.489000_152_900.eval.log
real 8.25
user 25.87
sys 0.27

@stweil
Copy link
Contributor

stweil commented Feb 6, 2022

Which time figures (real, user. sys) are important? Which scenario is preferable?

"real" is the time spent from program start to termination.
"user" and "sys" is the accumulated time used by all CPUs in user space / system space.
For single threaded applications like Tesseract without OPENMP "real" is normally equal to the sum of "user" and "sys". "real" can also be much larger if the execution is delayed, for example by other applications running simultaneously.

In your test scenario lstmeval was much faster with OPENMP enabled ("real" is 331 s instead of 805 s), so you'd prefer that to get a result fast. The CPU resources where slightly more with OPENMP ("user" 1042 s and "sys" 9 s instead of 805 s / 0.05 s), so the faster execution costs some (acceptable) overhead in this case.

for 5.1.0, do you want to keep a way to enable OpenMP with --enable-openmp?

Yes, I think that's necessary because of compatibility and also because it can be useful as in @Shreeshrii's test case on ARM.

@stweil
Copy link
Contributor

stweil commented Feb 6, 2022

Running Tesseract with several threads seems to work better on ARM than on Intel architectures. I noticed that with Apple M1 (AARCH64), too.

@Shreeshrii
Copy link
Collaborator Author

Running Tesseract with several threads seems to work better on ARM than on Intel architectures. I noticed that with Apple M1 (AARCH64), too.

I am running this on AARCH64.

@zdenop
Copy link
Contributor

zdenop commented Feb 6, 2022

Also, my tests shows that the enabled OPENMP could make sense in some cases (e.g. for the best data model on Windows & MSVC2019 and Intel processor) It would be great if we found somebody familiar with OpenMP at least for review how tesseract use it...

@tdhintz
Copy link
Contributor

tdhintz commented Feb 16, 2022

My timings for OpenMP on Windows MSVC at the end of issue #3044.

@Shreeshrii
Copy link
Collaborator Author

Thanks, @tdhintz

It would be good to know if the results still hold. If possible, please rerun tests with the tesseract 5 released version or latest GitHub version, since there have been many changes since 2020.

@tdhintz
Copy link
Contributor

tdhintz commented Feb 23, 2022

@Shreeshrii I'll add that task to our plan for late March. We build with very specific settings to get best results and I'm sure the build process has changed again, so this will be a heavy lift.

@tdhintz
Copy link
Contributor

tdhintz commented Mar 25, 2022

Looks like someone did this already: OpenMP benchmark

@Shreeshrii
Copy link
Collaborator Author

Looks like someone did this already: OpenMP benchmark

That test by @zdenop uses one image 15 times. Your tests use many more combinations.

We ran a comparison between a pre-release of 4.0 and the current 5.0 on AVX2 and SSE hardware on Windows that I'll share just for grins. The 4.0 was built with floating point set to fast, COMDAT folding, OpenMP and was PGO optimized. The 5.0 build also used floating point 'fast' and COMDAT folding, but without OpenMP and without PGO optimization.
2,880 combinations of settings and images were tested for each AVX2 and SSE platform. The tests are by no means comprehensive of all possible combinations. For example, only Eng traindata was used, although the Fast, Best and Blended data were all used.

this will be a heavy lift.

I understand.
If possible to do, the results can be added to tessdoc for easy reference. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants