New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there any way of speeding things up? #293
Comments
You can speed up the tesseract OCR process by setting parameter Also it could be good to check if tesseract is build with OpenMP support (if yes then set environment variable BTW: I would expect that pytesseract is more slower than tesserocr as pytesseract has more IO operations:
So there are two possible explanations:
|
I've spent the day testing, and tesserocr seems to be slower than pytesseract. I need the boxes, so I'm comparing to image_to_data from pytesseract Here's my quick benchmark script (the test image is a 3300x2800 invoice image). PyTesseract takes 2.6s, tesserocr takes 2.8s
|
BTW: If you are serious about measuring of speed, you should do it correctly:
Have a look at tesseract Benchmarks page for inspiration. You did not provide image so I used my own and I got totally different results with your script (35% difference for tesserocr):
So I use other image (from tesseract issue 263) and results are different:
=> problem is in your testing script not in Tesserocr / Pytesseract (difference is only in number of extra steps that Pytesseract needs to do, but on modern hardware is should not be a significant). |
Interesting findings, thanks for the follow-up! 💪🏻 |
Is it reversed or did TesserOCR really perform 3 times worst than pytesseract ? |
AFAIR it is not reversed. It is just an example if you do the wrong implementation of logic with tesserocr with a complex image, you get the wrong result. |
This is not really an issue, just a question on ways of speeding things up.
I have been doing some benchmarking with tesseract 5.0.0. For a 20 page pdf document with scanned images (depending on the type of processor) it takes around 103 seconds to run the CLI wrapper pytesseract, whereas preloading the api gets 0.6 seconds speed-up. Here is a link to a colab notebook.
Is there any faster way of getting the texts, other than this:
The text was updated successfully, but these errors were encountered: