Extracting words in document using threading #1151

nachonavarro · 2021-07-16T19:10:37Z

nachonavarro
Jul 16, 2021

Hi!

Is there any way to parallelize the following code that extracts all the words in a PDF:

doc = fitz.open("hello.pdf")
words = []
for page in doc:
  # Do some other stuff...
  words.extend(page.get_text_words())

I tried using the concurrent futures library:

from concurrent.futures import ThreadPoolExecutor

doc = fitz.open("hello.pdf")
with ThreadPoolExecutor() as executor:
  words = executor.map(lambda page: page.get_text_words(), doc)

but it crashes for some docs so I'm guessing this is not thread-safe?
Why does it crash for some documents but not for others?

Thanks!

Answered by JorjMcKie

Jul 16, 2021

No, as mentioned in the documentation, PyMuPDF does not support Python threading. Use multiprocessing instead. There are example scripts in the docu.
Why it sometimes crashed and not always etc., is hard to tell without knowing anything about the environment.

View full answer

JorjMcKie · 2021-07-16T21:25:28Z

JorjMcKie
Jul 16, 2021
Maintainer

No, as mentioned in the documentation, PyMuPDF does not support Python threading. Use multiprocessing instead. There are example scripts in the docu.
Why it sometimes crashed and not always etc., is hard to tell without knowing anything about the environment.

1 reply

JorjMcKie Jul 17, 2021
Maintainer

So you must use ProcessPoolExecutor instead of ThreadPoolExecutor.
Note that fitz.Document and other PyMuPDF objects are not picklable, which requires some thought to work around. So the worker process always needs to create its own document - a "centrally defined" document object is not accessible by it and cannot be passed to it. The filename must be passed instead. Binary stream as document are also possible but not advisable because of the pickling size.
This of course causes some overhead, which should be minimized by letting the worker handle a chunk of pages instead of only single ones. Best is to subdivide the number of document pages into n pagechunks, where n is the number of available / to-be-used processors, etc.
So the executor would be called via executor.map(worker, pagechunks).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting words in document using threading #1151

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Extracting words in document using threading #1151

nachonavarro Jul 16, 2021

Replies: 1 comment · 1 reply

JorjMcKie Jul 16, 2021 Maintainer

JorjMcKie Jul 17, 2021 Maintainer

nachonavarro
Jul 16, 2021

Replies: 1 comment 1 reply

JorjMcKie
Jul 16, 2021
Maintainer

JorjMcKie Jul 17, 2021
Maintainer