# Details on Text Extraction

    This chapter provides background on the text extraction methods of PyMuPDF.

    Information of interest are

    what do they provide?

    what do they imply (processing time / data sizes)?

# General structure of a TextPage
    TextPage is one of (Py-) MuPDF’s classes. It is normally created (and destroyed again) behind the curtain, when Page text extraction methods are used, but it is also available directly and can be used as a persistent object. Other than its name suggests, images may optionally also be part of a text page:

    <page>
        <text block>
            <line>
                <span>
                    <char>
        <image block>
            <img>
    A text page consists of blocks (= roughly paragraphs).

    A block consists of either lines and their characters, or an image.

    A line consists of spans.

    A span consists of adjacent characters with identical font properties: name, size, flags and color.

# Plain Text
    Function TextPage.extractText() (or Page.get_text(“text”)) extracts a page’s plain text in original order as specified by the creator of the document.

    An example output:

        print(page.get_text("text"))
    Some text on first page.
    Note

    The output may not equal an accustomed “natural” reading order. However, you can request a reordering following the scheme “top-left to bottom-right” by executing page.get_text("text", sort=True).

# BLOCKS
    Function TextPage.extractBLOCKS() (or Page.get_text(“blocks”)) extracts a page’s text blocks as a list of items like:

        (x0, y0, x1, y1, "lines in block", block_no, block_type)
    Where the first 4 items are the float coordinates of the block’s bbox. The lines within each block are concatenated by a new-line character.

    This is a high-speed method, which by default also extracts image meta information: Each image appears as a block with one text line, which contains meta information. The image itself is not shown.

    As with simple text output above, the sort argument can be used as well to obtain a reading order.

    Example output:

        print(page.get_text("blocks", sort=False))
        [(50.0, 88.17500305175781, 166.1709747314453, 103.28900146484375,
    'Some text on first page.', 0, 0)]

# WORDS
    Function TextPage.extractWORDS() (or Page.get_text(“words”)) extracts a page’s text words as a list of items like:

    (x0, y0, x1, y1, "word", block_no, line_no, word_no)
    Where the first 4 items are the float coordinates of the words’s bbox. The last three integers provide some more information on the word’s whereabouts.

    This is a high-speed method. As with the previous methods, argument sort=True will reorder the words.

    Example output:

        for word in page.get_text("words", sort=False):
                print(word)
        (50.0, 88.17500305175781, 78.73200225830078, 103.28900146484375,
        'Some', 0, 0, 0)
        (81.79000091552734, 88.17500305175781, 99.5219955444336, 103.28900146484375,
        'text', 0, 0, 1)
        (102.57999420166016, 88.17500305175781, 114.8119888305664, 103.28900146484375,
        'on', 0, 0, 2)
        (117.86998748779297, 88.17500305175781, 135.5909881591797, 103.28900146484375,
        'first', 0, 0, 3)
        (138.64898681640625, 88.17500305175781, 166.1709747314453, 103.28900146484375,
        'page.', 0, 0, 4)

# DICT (or JSON)
    TextPage.extractDICT() (or Page.get_text(“dict”, sort=False)) output fully reflects the structure of a TextPage and provides image content and position detail (bbox – boundary boxes in pixel units) for every block, line and span. Images are stored as bytes for DICT output and base64 encoded strings for JSON output.

    For a visualization of the dictionary structure have a look at Structure of Dictionary Outputs.

    Here is how this looks like:

        {
            "width": 300.0,
            "height": 350.0,
            "blocks": [{
                "type": 0,
                "bbox": (50.0, 88.17500305175781, 166.1709747314453, 103.28900146484375),
                "lines": ({
                    "wmode": 0,
                    "dir": (1.0, 0.0),
                    "bbox": (50.0, 88.17500305175781, 166.1709747314453, 103.28900146484375),
                    "spans": ({
                        "size": 11.0,
                        "flags": 0,
                        "font": "Helvetica",
                        "color": 0,
                        "origin": (50.0, 100.0),
                        "text": "Some text on first page.",
                        "bbox": (50.0, 88.17500305175781, 166.1709747314453, 103.28900146484375)
                    })
                }]
            }]
        }