Is there any way of speeding things up? #293

senisioi · 2022-01-06T10:57:43Z

This is not really an issue, just a question on ways of speeding things up.
I have been doing some benchmarking with tesseract 5.0.0. For a 20 page pdf document with scanned images (depending on the type of processor) it takes around 103 seconds to run the CLI wrapper pytesseract, whereas preloading the api gets 0.6 seconds speed-up. Here is a link to a colab notebook.

Is there any faster way of getting the texts, other than this:

with PyTessBaseAPI(lang=lang) as api:    
    text = ''.join([api.GetUTF8Text() for img in images if api.SetImage(img) is None])

zdenop · 2022-01-25T12:12:02Z

You can speed up the tesseract OCR process by setting parameter tessedit_do_invert to 0 (if you can manage the input is dark text in light background).

Also it could be good to check if tesseract is build with OpenMP support (if yes then set environment variable OMP_THREAD_LIMIT=1). Search tesseract issue tracker for more details about these parameters.

BTW: I would expect that pytesseract is more slower than tesserocr as pytesseract has more IO operations:

it needs to load/initialize language model each run
it saves input file to storage
OCR result is written to file, so it has to read from storage

So there are two possible explanations:

IO operation is not important nowadays
There is a space for performance improvement of tesserocr

logan-markewich · 2022-11-24T21:16:49Z

I've spent the day testing, and tesserocr seems to be slower than pytesseract. I need the boxes, so I'm comparing to image_to_data from pytesseract

Here's my quick benchmark script (the test image is a 3300x2800 invoice image).

PyTesseract takes 2.6s, tesserocr takes 2.8s

from PIL import Image
from tesserocr import PyTessBaseAPI, RIL



texts = []
image = Image.open('./test.png')

import time

with PyTessBaseAPI() as api:
    start = time.time()
    api.SetImage(image)
    boxes = api.GetComponentImages(RIL.TEXTLINE, True) # or others values from RIL, depends on your image
    print('Found {} image components.'.format(len(boxes)))
    for i, (im, box, _, _) in enumerate(boxes):
        # im is a PIL image object
        # box is a dict with x, y, w and h keys
        delta = image.size[0] / 200
        api.SetRectangle(box['x'] - delta, box['y'] - delta, box['w'] + 2 * delta, box['h'] + 2 * delta)
        # widening the box with delta can greatly improve the text output
        ocrResult = api.GetUTF8Text()
        texts.append(ocrResult)
    end = time.time()

print('Tesserocr (with boxes) took {}'.format(end-start))

import pytesseract

start = time.time()
res = pytesseract.image_to_data(image)
end = time.time()

print('Pytesseract took {}'.format(end-start))

zdenop · 2022-11-25T12:43:48Z

BTW: If you are serious about measuring of speed, you should do it correctly:

do not run tesserocr and pytesseract in the same script
run test multiple time and use average time (first run is usually worst)
you should compare comparable (e.g. image_to_data is getting tabular data at once from tesseract, while ask tesserocr in loop to OCR image and process line data)

Have a look at tesseract Benchmarks page for inspiration.

You did not provide image so I used my own and I got totally different results with your script (35% difference for tesserocr):

Found 39 image components.
Tesserocr (with boxes) took 1.842365026473999
Pytesseract took 2.490272045135498

So I use other image (from tesseract issue 263) and results are different:

Found 50 image components.
Tesserocr (with boxes) took 6.5823915004730225
Pytesseract took 2.264941453933716

=> problem is in your testing script not in Tesserocr / Pytesseract (difference is only in number of extra steps that Pytesseract needs to do, but on modern hardware is should not be a significant).

logan-markewich · 2022-11-25T15:27:38Z

Interesting findings, thanks for the follow-up! 💪🏻

mali-tintash · 2023-08-28T11:13:04Z

BTW: If you are serious about measuring of speed, you should do it correctly:

do not run tesserocr and pytesseract in the same script

run test multiple time and use average time (first run is usually worst)

you should compare comparable (e.g. image_to_data is getting tabular data at once from tesseract, while ask tesserocr in loop to OCR image and process line data)

Have a look at tesseract Benchmarks page for inspiration.

You did not provide image so I used my own and I got totally different results with your script (35% difference for tesserocr):
Found 39 image components.
Tesserocr (with boxes) took 1.842365026473999
Pytesseract took 2.490272045135498
So I use other image (from tesseract issue 263) and results are different:
Found 50 image components.
Tesserocr (with boxes) took 6.5823915004730225
Pytesseract took 2.264941453933716
=> problem is in your testing script not in Tesserocr / Pytesseract (difference is only in number of extra steps that Pytesseract needs to do, but on modern hardware is should not be a significant).

Found 50 image components. Tesserocr (with boxes) took 6.5823915004730225 Pytesseract took 2.264941453933716

Is it reversed or did TesserOCR really perform 3 times worst than pytesseract ?

zdenop · 2023-08-29T20:38:09Z

AFAIR it is not reversed. It is just an example if you do the wrong implementation of logic with tesserocr with a complex image, you get the wrong result.
There is no reason why pytesseract could be faster than tesserocr (if you use the same version of tesseract library.)

sirfz closed this as completed Jul 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any way of speeding things up? #293

Is there any way of speeding things up? #293

senisioi commented Jan 6, 2022 •

edited

zdenop commented Jan 25, 2022

logan-markewich commented Nov 24, 2022 •

edited

zdenop commented Nov 25, 2022

logan-markewich commented Nov 25, 2022

mali-tintash commented Aug 28, 2023

zdenop commented Aug 29, 2023

Is there any way of speeding things up? #293

Is there any way of speeding things up? #293

Comments

senisioi commented Jan 6, 2022 • edited

zdenop commented Jan 25, 2022

logan-markewich commented Nov 24, 2022 • edited

zdenop commented Nov 25, 2022

logan-markewich commented Nov 25, 2022

mali-tintash commented Aug 28, 2023

zdenop commented Aug 29, 2023

senisioi commented Jan 6, 2022 •

edited

logan-markewich commented Nov 24, 2022 •

edited