# Speedcheck: Show performance differences using separate TextPage objects
With its v1.19.0, PyMuPDF has added a new parameter `textpage` to text extraction and text search methods. This allows reuse and sharing of the same TextPage object.



## Background
All text searches and extractions in PyMuPDF under the hood work in 2 steps:

1. Step 1 creates a "TextPage" object, which parses the document page and extracts its content based on a number of flags (which control whether e.g. images should also be included in the results). Execution time of this step is **always longer** than that of the second step.
2. Step 2 walks through the **TextPage's content** and generates the requested output (simple text, HTML, search result rectangles, etc). When finished, the textpage is destroyed again.

> MuPDF's major motivation for object "TextPage" was to abstract from the document type: only the logic for creating a textpage is different between a PDF, XPS, EPUB, etc. The application-side logic to create the desired result is not impacted by the document type.

This notebook shows how to break up text search and extraction into the above two steps. Whenever you need to do multiple searches or extractions on the same document page, you can base them all on the same TextPage and that way gain considerable performance improvements.


## Demonstrations

The following snippets demonstrate the **very significant performance boosts** in cases where you perform multiple text searches and / or extractions on the same page.

In [8]:
# Uncomment the following to ensure PyMuPDF is installed
#!python -m pip install pymupdf

import fitz  # import PyMuPDF

doc = fitz.open("1page.pdf")  # a PDF in this folder
page = doc[0]  # load first page

Execute in the conventional way, with a new TextPage every time.

For every of the following extraction variants, the response times should be in range of 1 to 2.x **_milliseconds_**.

In [9]:
%timeit page.get_text("text")  # basic "naive" text extraction
%timeit page.get_text("words")  # extract by separate words
%timeit page.get_text("blocks")  # extract by text paragraphs
%timeit page.get_text("dict")  # extract showing span-level details
%timeit page.get_text("rawdict")  # extract showing character details

1.33 ms ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1.43 ms ± 16 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1.37 ms ± 9.53 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1.5 ms ± 31.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
2.18 ms ± 6.03 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


We will now take timings of TextPage creation alone. It reveals that it **_always_** is the largest part of any text extraction variant shown before:

In [10]:
%timeit page.get_textpage()

1.22 ms ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


Compare the following text extraction response times, where an **_existing TextPage is reused_**, with the corresponding durations above. You should see execution time reductions by 60% to 95%.

So if performance is a concern, do reuse a predefined Textpage!

In [11]:
tp = page.get_textpage()  # prepare a separate textpage, then reuse it:
%timeit page.get_text("text", textpage=tp)  # -95%
%timeit page.get_text("words", textpage=tp)  # -90%
%timeit page.get_text("blocks", textpage=tp)  # -95%
%timeit page.get_text("dict", textpage=tp)  # -85%
%timeit page.get_text("rawdict", textpage=tp)  # -60%

38.7 µs ± 643 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
114 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
47.5 µs ± 570 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
187 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
771 µs ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


If a text extraction reuses a TextPage, its duration goes down significantly - sometimes by more than 90% - in the **_microsecond_** region.

## Realistic Use Cases
Separate TextPages make sense if text of the **_same document page_** must be searched or extracted multiple times. 

The following example searches some text and then, for each hit rectangle extracts the text it contains (so spelling differenes can be investigated, or the presence of line breaks).

In [12]:
def search1():  # do NOT reuse TextPage
    """Search for a word, then extract it from each hit rectangle.
    Do not reuse intermediate TextPages.
    """
    rl = page.search_for("pixmap")
    for r in rl:
        text = page.get_textbox(r)

def search2():  # reuse a TextPage
    """Search for a word,  then extract it from each hit rectangle.
    Reuse a previously created TextPage.
    """
    rl = page.search_for("pixmap", textpage=tp)
    for r in rl:
        text = page.get_textbox(r, textpage=tp)

In [13]:
%timeit search1()

19.3 ms ± 207 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [14]:
%timeit search2()

548 µs ± 8.91 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


Again, reusing a TextPage has saved us more than 95% of the execution time.