# Opening Files
## Supported File Types
### PyMuPDF can open files other than just PDF.

The following file types are supported:

The following file types are supported:

#### File type
##### Document Formats
        PDF XPS EPUB MOBI FB2 CBZ SVG TXT
#### Image Formats
        Input formats
            JPG/JPEG, PNG, BMP, GIF, TIFF, PNM, PGM, PBM, PPM, PAM, JXR, JPX/JP2, PSD
        Output formats
            JPG/JPEG, PNG, PNM, PGM, PBM, PPM, PAM, PSD, PS


### How to Open a File
To open a file, do the following:

        doc = pymupdf.open("a.pdf")

### Opening with a Wrong File Extension
If you have a document with a wrong file extension for its type, you can still correctly open it.

Assume that “some.file” is actually an XPS. Open it like so:

        doc = pymupdf.open("some.file", filetype="xps")

### Opening Files as Text
PyMuPDF has the capability to open any plain text file as a document. In order to do this you should provide the filetype parameter for the pymupdf.open function as "txt".

        doc = pymupdf.open("my_program.py", filetype="txt")


Opening Files as Text
PyMuPDF has the capability to open any plain text file as a document. In order to do this you should provide the filetype parameter for the pymupdf.open function as "txt".

doc = pymupdf.open("my_program.py", filetype="txt")

In this way you are able to open a variety of file types and perform the typical non-PDF specific features like text searching, text extracting and page rendering. Obviously, once you have rendered your txt content, then saving as PDF or merging with other PDF files is no problem.

### Examples
### Opening a C# file
        doc = pymupdf.open("MyClass.cs", filetype="txt")
### Opening an XML file
        doc = pymupdf.open("my_data.xml", filetype="txt")
### Opening a JSON file
        doc = pymupdf.open("more_of_my_data.json", filetype="txt")
And so on!

As you can imagine many text based file formats can be very simply opened and interpreted by PyMuPDF. This can make data analysis and extraction for a wide range of previously unavailable files suddenly possible.

### How to Extract Text in Natural Reading Order
One of the common issues with PDF text extraction is, that text may not appear in any particular reading order.

This is the responsibility of the PDF creator (software or a human). For example, page headers may have been inserted in a separate step – after the document had been produced. In such a case, the header text will appear at the end of a page text extraction (although it will be correctly shown by PDF viewer software). For example, the following snippet will add some header and footer lines to an existing PDF:

        doc = pymupdf.open("some.pdf")
        header = "Header"  # text in header
        footer = "Page %i of %i"  # text in footer
        for page in doc:
            page.insert_text((50, 50), header)  # insert header
            page.insert_text(  # insert footer 50 points above page bottom
                (50, page.rect.height - 50),
                footer % (page.number + 1, doc.page_count),
            )

The text sequence extracted from a page modified in this way will look like this:

1. original text

2. header line

3. footer line

PyMuPDF has several means to re-establish some reading sequence or even to re-generate a layout close to the original:

1. Use sort parameter of Page.get_text(). It will sort the output from top-left to bottom-right (ignored for XHTML, HTML and XML output).

2. Use the pymupdf module in CLI: python -m pymupdf gettext ..., which produces a text file where text has been re-arranged in layout-preserving mode. Many options are available to control the output.

You can also use the above mentioned script with your modifications.

### How to Extract Table Content from Documents
If you see a table in a document, you are normally not looking at something like an embedded Excel or other identifiable object. It usually is just normal, standard text, formatted to appear as tabular data.

Extracting tabular data from such a page area therefore means that you must find a way to identify the table area (i.e. its boundary box), then (1) graphically indicate table and column borders, and (2) then extract text based on this information.

This can be a very complex task, depending on details like the presence or absence of lines, rectangles or other supporting vector graphics.

Method Page.find_tables() does all that for you, with a high table detection precision. Its great advantage is that there are no external library dependencies, nor the need to employ artificial intelligence or machine learning technologies. It also provides an integrated interface to the well-known Python package for data analysis pandas.

Please have a look at example Jupyter notebooks, which cover standard situations like multiple tables on one page or joining table fragments across multiple pages.

Link to the github repository as per the above parra
https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/table-analysis

### How to Mark Extracted Text
There is a standard search function to search for arbitrary text on a page: 
Page.search_for(). It returns a list of Rect objects which surround a found occurrence. These rectangles can for example be used to automatically insert annotations which visibly mark the found text.

This method has advantages and drawbacks. Pros are:

The search string can contain blanks and wrap across lines

Upper or lower case characters are treated equal

Word hyphenation at line ends is detected and resolved

Return may also be a list of Quad objects to precisely locate text that is not parallel to either axis – using Quad output is also recommended, when page rotation is not zero.

But you also have other options:


        import sys
        import pymupdf

        def mark_word(page, text):
            """Underline each word that contains 'text'.
            """
            found = 0
            wlist = page.get_text("words", delimiters=None)  # make the word list
            for w in wlist:  # scan through all words on page
                if text in w[4]:  # w[4] is the word's string
                    found += 1  # count
                    r = pymupdf.Rect(w[:4])  # make rect from word bbox
                    page.add_underline_annot(r)  # underline
            return found

        fname = sys.argv[1]  # filename
        text = sys.argv[2]  # search string
        doc = pymupdf.open(fname)

        print("underlining words containing '%s' in document '%s'" % (word, doc.name))

        new_doc = False  # indicator if anything found at all

        for page in doc:  # scan through the pages
            found = mark_word(page, text)  # mark the page's words
            if found:  # if anything found ...
                new_doc = True
                print("found '%s' %i times on page %i" % (text, found, page.number + 1))

        if new_doc:
            doc.save("marked-" + doc.name)



This script uses Page.get_text("words") to look for a string, handed in via cli parameter. This method separates a page’s text into “words” using white spaces as delimiters. Further remarks:

If found, the complete word containing the string is marked (underlined) – not only the search string.

The search string may not contain word delimiters. By default, word delimiters are white spaces and the non-breaking space chr(0xA0). If you use extra delimiting characters like page.get_text("words", delimiters="./,") then none of these characters should be included in your search string either.

As shown here, upper / lower cases are respected. But this can be changed by using the string method lower() (or even regular expressions) in function mark_word.

There is no upper limit: all occurrences will be detected.

You can use anything to mark the word: ‘Underline’, ‘Highlight’, ‘StrikeThrough’ or ‘Square’ annotations, etc.

Here is an example snippet of a page of this manual, where “MuPDF” has been used as the search string. Note that all strings containing “MuPDF” have been completely underlined (not just the search string).



### How to Mark Searched Text code
        This script searches for text and marks it:

        # -*- coding: utf-8 -*-
        import pymupdf

        # the document to annotate
        doc = pymupdf.open("tilted-text.pdf")

        # the text to be marked
        needle = "¡La práctica hace el campeón!"

        # work with first page only
        page = doc[0]

        # get list of text locations
        # we use "quads", not rectangles because text may be tilted!
        rl = page.search_for(needle, quads=True)

        # mark all found quads with one annotation
        page.add_squiggly_annot(rl)

        # save to a new PDF
        doc.save("a-squiggly.pdf")



In [10]:
import pymupdf

doc = pymupdf.open('example.pdf')

page = doc[0]

r1 = page.search_for('confidential', quads=True)
print(r1)
# for i in r1:
#     page.add_rect_annot(i[:3])
page.add_squiggly_annot(r1)

doc.save('squiggly-annot-example.pdf')
print(f'resquigglyct-annot-example.pdf is saved')

[Quad(Point(179.90599060058594, 61.570030212402344), Point(240.6019744873047, 61.570030212402344), Point(179.90599060058594, 78.05802917480469), Point(240.6019744873047, 78.05802917480469)), Quad(Point(31.190000534057617, 89.92000579833984), Point(94.55000305175781, 89.92000579833984), Point(31.190000534057617, 106.40800476074219), Point(94.55000305175781, 106.40800476074219))]
resquigglyct-annot-example.pdf is saved


### How to Mark Non-horizontal Text
The previous section already shows an example for marking non-horizontal text, that was detected by text searching.

But text extraction with the “dict” / “rawdict” options of Page.get_text() may also return text with a non-zero angle to the x-axis. This is indicated by the value of the line dictionary’s "dir" key: it is the tuple (cosine, sine) for that angle. If line["dir"] != (1, 0), then the text of all its spans is rotated by (the same) angle != 0.

The “bboxes” returned by the method however are rectangles only – not quads. So, to mark span text correctly, its quad must be recovered from the data contained in the line and span dictionary. Do this with the following utility function (new in v1.18.9):

        span_quad = pymupdf.recover_quad(line["dir"], span)
        annot = page.add_highlight_annot(span_quad)  # this will mark the complete span text
If you want to mark the complete line or a subset of its spans in one go, use the following snippet (works for v1.18.10 or later):

        line_quad = pymupdf.recover_line_quad(line, spans=line["spans"][1:-1])
        page.add_highlight_annot(line_quad)
_images/img-linequad.jpg


The spans argument above may specify any sub-list of line["spans"]. In the example above, the second to second-to-last span are marked. If omitted, the complete line is taken.



### How to Analyze Font Characteristics
To analyze the characteristics of text in a PDF use this elementary script as a starting point:

        import sys

        import pymupdf


        def flags_decomposer(flags):
            """Make font flags human readable."""
            l = []
            if flags & 2 ** 0:
                l.append("superscript")
            if flags & 2 ** 1:
                l.append("italic")
            if flags & 2 ** 2:
                l.append("serifed")
            else:
                l.append("sans")
            if flags & 2 ** 3:
                l.append("monospaced")
            else:
                l.append("proportional")
            if flags & 2 ** 4:
                l.append("bold")
            return ", ".join(l)


        doc = pymupdf.open(sys.argv[1])
        page = doc[0]

        # read page text as a dictionary, suppressing extra spaces in CJK fonts
        blocks = page.get_text("dict", flags=11)["blocks"]
        for b in blocks:  # iterate through the text blocks
            for l in b["lines"]:  # iterate through the text lines
                for s in l["spans"]:  # iterate through the text spans
                    print("")
                    font_properties = "Font: '%s' (%s), size %g, color #%06x" % (
                        s["font"],  # font name
                        flags_decomposer(s["flags"]),  # readable font flags
                        s["size"],  # font size
                        s["color"],  # font color
                    )
                    print("Text: '%s'" % s["text"])  # simple print of text
                    print(font_properties)

### How to Insert Text
PyMuPDF provides ways to insert text on new or existing PDF pages with the following features:

choose the font, including built-in fonts and fonts that are available as files

choose text characteristics like bold, italic, font size, font color, etc.

position the text in multiple ways:

either as simple line-oriented output starting at a certain point,

or fitting text in a box provided as a rectangle, in which case text alignment choices are also available,

choose whether text should be put in foreground (overlay existing content),

all text can be arbitrarily “morphed”, i.e. its appearance can be changed via a Matrix, to achieve effects like scaling, shearing or mirroring,

independently from morphing and in addition to that, text can be rotated by integer multiples of 90 degrees.

All of the above is provided by three basic Page, resp. Shape methods:

Page.insert_font() – install a font for the page for later reference. The result is reflected in the output of Document.get_page_fonts(). The font can be:

provided as a file,

via Font (then use Font.buffer)

already present somewhere in this or another PDF, or

be a built-in font.

Page.insert_text() – write some lines of text. Internally, this uses Shape.insert_text().

Page.insert_textbox() – fit text in a given rectangle. Here you can choose text alignment features (left, right, centered, justified) and you keep control as to whether text actually fits. Internally, this uses Shape.insert_textbox().

In [11]:
import pymupdf

doc = pymupdf.open()  # new or existing PDF
page = doc.new_page()  # new page, or choose doc[n]

# write in this overall area
rect = pymupdf.Rect(100, 100, 300, 150)

# partition the area in 4 equal sub-rectangles
CELLS = pymupdf.make_table(rect, cols=4, rows=1)

t1 = "text with rotate = 0."  # these texts we will written
t2 = "text with rotate = 90."
t3 = "text with rotate = 180."
t4 = "text with rotate = 270."
text = [t1, t2, t3, t4]
red = pymupdf.pdfcolor["red"]  # some colors
gold = pymupdf.pdfcolor["gold"]
blue = pymupdf.pdfcolor["blue"]
"""
We use a Shape object (something like a canvas) to output the text and
the rectangles surrounding it for demonstration.
"""
shape = page.new_shape()  # create Shape
for i in range(len(CELLS[0])):
    shape.draw_rect(CELLS[0][i])  # draw rectangle
    shape.insert_textbox(
        CELLS[0][i], text[i], fontname="hebo", color=blue, rotate=90 * i
    )

shape.finish(width=0.3, color=red, fill=gold)

shape.commit()  # write all stuff to the page
doc.ez_save(__file__.replace(".py", ".pdf"))

NameError: name '__file__' is not defined

1
