Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run Tesseract using more than 4 threads ? #1600

Closed
Chien-Hao opened this issue May 24, 2018 · 6 comments

Comments

@Chien-Hao
Copy link

commented May 24, 2018

Hello,

I am using tesseract-4.0.0-beta.1, and I have a ARM server with 96 CPU cores/threads.
When I ran tesseract, it seems that by default it only uses 4 cores of the machine.
Is it possible to run tesseract using more than 4 threads ?

I have tried to modify the parameter "kNumThreads" in the source code : lstm/fullyconnected.cpp, also num_threads() in ccmain/par_control.cpp and re-compiler tesseract. But the execution time is still almost the same. I have searched many web pages for the answer but none of them is satisfied.

Any suggestion and advice is welcome and appreciated.
Thanks a lot.

Platform:
Linux localhost.localdomain 4.14.0-49.el7a.aarch64

@KyleBruene

This comment has been minimized.

Copy link

commented Jun 1, 2018

Short answer: No

As per FAQ

Can I increase speed of OCR?
If you are running Tesseract 4, you can use the "fast" integer models.

Tesseract 4 also uses up to four CPU threads while processing a page, so it will be faster than Tesseract 3 for a single page.

If your computer has only two CPU cores, then running four threads will slow down things significantly and it would be better to use a single thread or maybe a maximum of two threads! Using a single thread eliminates the computation overhead of multithreading and is also the best solution for processing lots of images by running one Tesseract process per CPU core.

Set the maximum number of threads using the environment variable OMP_THREAD_LIMIT.

To disable multithreading, use OMP_THREAD_LIMIT=1.

In your case, I would simply split the document/image into multiple files and use GNU Parallel to run multiple instances of tesseract with multithreading disabled.

@Chien-Hao

This comment has been minimized.

Copy link
Author

commented Jun 4, 2018

Okay, thanks a lot for your suggestion.

@Chien-Hao Chien-Hao closed this Jun 4, 2018
@ghost

This comment has been minimized.

Copy link

commented Jun 5, 2018

@KyleBruene Good suggestion. but when using parallel, will the command be the same in the terminal?
parallel tesseract image.jpg output.txt -l eng

@ghost

This comment has been minimized.

@Chien-Hao

This comment has been minimized.

Copy link
Author

commented Jun 6, 2018

@christophered Thanks a lot for your information. I will try them if necessary.

@sirius0503

This comment has been minimized.

Copy link

commented Sep 18, 2019

Short answer: No

As per FAQ

Can I increase speed of OCR?
If you are running Tesseract 4, you can use the "fast" integer models.
Tesseract 4 also uses up to four CPU threads while processing a page, so it will be faster than Tesseract 3 for a single page.
If your computer has only two CPU cores, then running four threads will slow down things significantly and it would be better to use a single thread or maybe a maximum of two threads! Using a single thread eliminates the computation overhead of multithreading and is also the best solution for processing lots of images by running one Tesseract process per CPU core.
Set the maximum number of threads using the environment variable OMP_THREAD_LIMIT.
To disable multithreading, use OMP_THREAD_LIMIT=1.

In your case, I would simply split the document/image into multiple files and use GNU Parallel to run multiple instances of tesseract with multithreading disabled.

@KyleBruene What is the difference between GNU parallel and multiprocessing module of python, and which might be faster?

@DiegoPino DiegoPino referenced this issue Oct 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.