Extracting words in document using threading #1151
Answered
by
JorjMcKie
nachonavarro
asked this question in
Q&A
-
Hi! Is there any way to parallelize the following code that extracts all the words in a PDF: doc = fitz.open("hello.pdf")
words = []
for page in doc:
# Do some other stuff...
words.extend(page.get_text_words()) I tried using the concurrent futures library: from concurrent.futures import ThreadPoolExecutor
doc = fitz.open("hello.pdf")
with ThreadPoolExecutor() as executor:
words = executor.map(lambda page: page.get_text_words(), doc) but it crashes for some docs so I'm guessing this is not thread-safe? Thanks! |
Beta Was this translation helpful? Give feedback.
Answered by
JorjMcKie
Jul 16, 2021
Replies: 1 comment 1 reply
-
No, as mentioned in the documentation, PyMuPDF does not support Python threading. Use multiprocessing instead. There are example scripts in the docu. |
Beta Was this translation helpful? Give feedback.
1 reply
Answer selected by
nachonavarro
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
No, as mentioned in the documentation, PyMuPDF does not support Python threading. Use multiprocessing instead. There are example scripts in the docu.
Why it sometimes crashed and not always etc., is hard to tell without knowing anything about the environment.