# Speedcheck: Show performance differences using separate TextPage objects
With its v1.19.0, PyMuPDF has added a new parameter `textpage` to text extraction and text search methods. This allows reuse and sharing of the same TextPage object.

## Background
All text searches and extractions in PyMuPDF under the hood work in 2 steps:

1. Step 1 creates a "TextPage" object, which parses the document page and extracts its content based on a number of flags (which control whether e.g. images should also be included in the results). Execution time of this step is **always longer** than that of the second step.
2. Step 2 walks through the **TextPage's content** and generates the requested output (simple text, HTML, search result rectangles, etc). When finished, the textpage is destroyed again.

> MuPDF's major motivation to entertain the object type "TextPage" was to abstract from the filetype a certain document happens to have: only the logic for creating a textpage is different between a PDF, XPS, HTML or whatever document: the application-side logic to create the desired result is not impacted by the document type.

To provide an intuitively simple, easy-to-use API, PyMuPDF does not bother the programmer with these details: every text search or text extraction looks the same for all document types and hence creates and deletes a TextPage every time.

However, Optical Character Recognition also happens inside TextPage creation and, because this may entail a significant execution time, it now does make sense to offer a way to reuse a TextPage and avoid multiple, potentially expensive creations.

While the main motivation was OCR, reusing a TextPage object is now **always** possible. It is not bound to OCR in any way.

## Demonstrations

The following snippets demonstrate the very significant performance gains in cases where you perform multiple text searches and / or extractions with the same page.

In [1]:
import fitz

if tuple(map(int, fitz.VersionBind.split("."))) < (1,19,0):
    raise RuntimeError("Need PyMuPDF v1.19.0 or higher")

doc = fitz.open("1page.pdf")
page = doc[0]

First execute with a new TextPage created every time:

In [2]:
%timeit page.get_text("text")
%timeit page.get_text("words")
%timeit page.get_text("blocks")
%timeit page.get_text("dict")
%timeit page.get_text("rawdict")

2.44 ms ± 106 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.83 ms ± 269 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.3 ms ± 75 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.64 ms ± 40.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.2 ms ± 53.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


The following shows, that TextPage creation **_always_** is the longest part of any text extraction response time.

In [3]:
%timeit page.get_textpage()

2.12 ms ± 81.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Compare the following text extraction response times, where an **_existing TextPage is reused_**, with the corresponding durations above. You should see execution times reduced by 50% to 95%.

So if performance is a concern, reuse a predefined Textpage.

In [4]:
tp = page.get_textpage()
%timeit page.get_text("text", textpage=tp)  # -95%
%timeit page.get_text("words", textpage=tp)  # -90%
%timeit page.get_text("blocks", textpage=tp)  # -95%
%timeit page.get_text("dict", textpage=tp)  # -85%
%timeit page.get_text("rawdict", textpage=tp)  # -60%

81.2 µs ± 3.49 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
252 µs ± 6.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
105 µs ± 1.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
410 µs ± 16.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.75 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## Realistic Use Cases
The following represent typical examples for creating the same TextPage mutliple times - which therefore will benefit a lot from avoiding this.

We search for some word on a page and then validate each occurrence (e.g. spelling or upper / lower case differences).

In [5]:
def search1():
    """Search for a word, then check each hit rectangle.
    Do not reuse intermediate TextPages."""
    rl = page.search_for("pixmap")
    for r in rl:
        text = page.get_textbox(r)

def search2():
    """Search for a word, then check each hit rectangle.
    Reuse a previously created TextPage."""
    rl = page.search_for("pixmap", textpage=tp)
    for r in rl:
        text = page.get_textbox(r, textpage=tp)

In [6]:
%timeit search1()

29.4 ms ± 803 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [7]:
%timeit search2()

1.05 ms ± 74.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Again, reusing an existing TextPage has saved us more than 95% of the execution time.