# Speedcheck: Show performance differences using separate TextPage objects

In [18]:
import fitz

if tuple(map(int, fitz.VersionBind.split("."))) < (1,19,0):
    raise RuntimeError("Need PyMuPDF v1.19.0 or higher")

doc = fitz.open("1page.pdf")
page = doc[0]

In [19]:
%timeit page.get_text("text")
%timeit page.get_text("words")
%timeit page.get_text("blocks")
%timeit page.get_text("dict")
%timeit page.get_text("rawdict")

2.17 ms ± 50.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.38 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.19 ms ± 84.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.56 ms ± 93.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.09 ms ± 193 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


The following shows, that TextPage creation **_always_** is the longest part of any text extraction response time.

In [20]:
%timeit page.get_textpage()

2.08 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Compare the following text extraction response times, where an existing TextPage is reused, with the corresponding durations above. You should see execution times reduced by 50% to 95%.

So whenever you see a chance, reuse a predefined Textpage if text extraction performance is a concern.

In [21]:
tp = page.get_textpage()
%timeit page.get_text("text", textpage=tp)  # -95%
%timeit page.get_text("words", textpage=tp)  # -90%
%timeit page.get_text("blocks", textpage=tp)  # -95%
%timeit page.get_text("dict", textpage=tp)  # -85%
%timeit page.get_text("rawdict", textpage=tp)  # -60%

64.7 µs ± 2.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
238 µs ± 4.63 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
97.1 µs ± 1.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
370 µs ± 3.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.56 ms ± 6.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
