Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any way of speeding things up? #293

Closed
senisioi opened this issue Jan 6, 2022 · 6 comments
Closed

Is there any way of speeding things up? #293

senisioi opened this issue Jan 6, 2022 · 6 comments

Comments

@senisioi
Copy link

senisioi commented Jan 6, 2022

This is not really an issue, just a question on ways of speeding things up.
I have been doing some benchmarking with tesseract 5.0.0. For a 20 page pdf document with scanned images (depending on the type of processor) it takes around 103 seconds to run the CLI wrapper pytesseract, whereas preloading the api gets 0.6 seconds speed-up. Here is a link to a colab notebook.

Is there any faster way of getting the texts, other than this:

with PyTessBaseAPI(lang=lang) as api:    
    text = ''.join([api.GetUTF8Text() for img in images if api.SetImage(img) is None])
@zdenop
Copy link
Contributor

zdenop commented Jan 25, 2022

You can speed up the tesseract OCR process by setting parameter tessedit_do_invert to 0 (if you can manage the input is dark text in light background).

Also it could be good to check if tesseract is build with OpenMP support (if yes then set environment variable OMP_THREAD_LIMIT=1). Search tesseract issue tracker for more details about these parameters.


BTW: I would expect that pytesseract is more slower than tesserocr as pytesseract has more IO operations:

So there are two possible explanations:

  1. IO operation is not important nowadays
  2. There is a space for performance improvement of tesserocr

@logan-markewich
Copy link

logan-markewich commented Nov 24, 2022

I've spent the day testing, and tesserocr seems to be slower than pytesseract. I need the boxes, so I'm comparing to image_to_data from pytesseract

Here's my quick benchmark script (the test image is a 3300x2800 invoice image).

PyTesseract takes 2.6s, tesserocr takes 2.8s

from PIL import Image
from tesserocr import PyTessBaseAPI, RIL



texts = []
image = Image.open('./test.png')

import time

with PyTessBaseAPI() as api:
    start = time.time()
    api.SetImage(image)
    boxes = api.GetComponentImages(RIL.TEXTLINE, True) # or others values from RIL, depends on your image
    print('Found {} image components.'.format(len(boxes)))
    for i, (im, box, _, _) in enumerate(boxes):
        # im is a PIL image object
        # box is a dict with x, y, w and h keys
        delta = image.size[0] / 200
        api.SetRectangle(box['x'] - delta, box['y'] - delta, box['w'] + 2 * delta, box['h'] + 2 * delta)
        # widening the box with delta can greatly improve the text output
        ocrResult = api.GetUTF8Text()
        texts.append(ocrResult)
    end = time.time()

print('Tesserocr (with boxes) took {}'.format(end-start))

import pytesseract

start = time.time()
res = pytesseract.image_to_data(image)
end = time.time()

print('Pytesseract took {}'.format(end-start))

@zdenop
Copy link
Contributor

zdenop commented Nov 25, 2022

BTW: If you are serious about measuring of speed, you should do it correctly:

  • do not run tesserocr and pytesseract in the same script
  • run test multiple time and use average time (first run is usually worst)
  • you should compare comparable (e.g. image_to_data is getting tabular data at once from tesseract, while ask tesserocr in loop to OCR image and process line data)

Have a look at tesseract Benchmarks page for inspiration.

You did not provide image so I used my own and I got totally different results with your script (35% difference for tesserocr):

Found 39 image components.
Tesserocr (with boxes) took 1.842365026473999
Pytesseract took 2.490272045135498

So I use other image (from tesseract issue 263) and results are different:

Found 50 image components.
Tesserocr (with boxes) took 6.5823915004730225
Pytesseract took 2.264941453933716

=> problem is in your testing script not in Tesserocr / Pytesseract (difference is only in number of extra steps that Pytesseract needs to do, but on modern hardware is should not be a significant).

@logan-markewich
Copy link

Interesting findings, thanks for the follow-up! 💪🏻

@sirfz sirfz closed this as completed Jul 19, 2023
@mali-tintash
Copy link

BTW: If you are serious about measuring of speed, you should do it correctly:

  • do not run tesserocr and pytesseract in the same script
  • run test multiple time and use average time (first run is usually worst)
  • you should compare comparable (e.g. image_to_data is getting tabular data at once from tesseract, while ask tesserocr in loop to OCR image and process line data)

Have a look at tesseract Benchmarks page for inspiration.

You did not provide image so I used my own and I got totally different results with your script (35% difference for tesserocr):

Found 39 image components.
Tesserocr (with boxes) took 1.842365026473999
Pytesseract took 2.490272045135498

So I use other image (from tesseract issue 263) and results are different:

Found 50 image components.
Tesserocr (with boxes) took 6.5823915004730225
Pytesseract took 2.264941453933716

=> problem is in your testing script not in Tesserocr / Pytesseract (difference is only in number of extra steps that Pytesseract needs to do, but on modern hardware is should not be a significant).

Found 50 image components. Tesserocr (with boxes) took 6.5823915004730225 Pytesseract took 2.264941453933716

Is it reversed or did TesserOCR really perform 3 times worst than pytesseract ?

@zdenop
Copy link
Contributor

zdenop commented Aug 29, 2023

AFAIR it is not reversed. It is just an example if you do the wrong implementation of logic with tesserocr with a complex image, you get the wrong result.
There is no reason why pytesseract could be faster than tesserocr (if you use the same version of tesseract library.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants