# Working with pdf's and all other type of files like PDF, Doc, Images, etc

### Installing the PyMuPDF module

        pip install PyMuPDF

### importing the module

In [1]:
import pymupdf

### Opening the File

In [18]:
import pymupdf

pdf = pymupdf.open('example.pdf')



## Some Document Methods

### Method / Attribute 

    Document.page_count     the number of pages (int)
    Document.metadata       the metadata (dict)
    Document.get_toc()      get the table of content (list)
    Document.load_page()    read a Page

In [19]:
import pymupdf

pdf = pymupdf.open('example.pdf')

print(pdf.metadata)

{'format': 'PDF 1.3', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'PyFPDF 1.7.2 http://pyfpdf.googlecode.com/', 'creationDate': 'D:20240612210545', 'modDate': '', 'trapped': '', 'encryption': None}


In [20]:
import pymupdf

pdf = pymupdf.open('example.pdf')

print(pdf.page_count)

1


In [22]:
import pymupdf

pdf = pymupdf.open('example.pdf')

# getting the table of contents
print(pdf.get_toc())

[]


In [25]:
import pymupdf

pdf = pymupdf.open('example.pdf')

# loading the page
print(pdf.load_page(0))

page 0 of example.pdf


## Ways to load a page from the document

    page numbers are similar to python list indices

In [26]:
import pymupdf

pdf = pymupdf.open('example.pdf')

p1 = pdf.load_page(0)

p2 = pdf[0]

print(p1)
print(p2)

page 0 of example.pdf
page 0 of example.pdf


        for page in doc:
        # do something with 'page'

# ... or read backwards
        for page in reversed(doc):
        # do something with 'page'

# ... or even use 'slicing'
        for page in doc.pages(start, stop, step):
        # do something with 'page'

Extract Text from a PDF

In [4]:
import pymupdf

pdf = pymupdf.open('example.pdf')  # open a document
out = open('output.txt', 'wb')  # create  text output

for page in pdf:    #   iterate the document pages
    text = page.get_text().encode('utf8')   # get plain text (is in UTF-8)
    out.write(text)     #   write text of page
    out.write(bytes((12,)))     # write page delimeter (from feed 0x0C)
out.close()

In [5]:
chr(12)

'\x0c'

Simple way to extract text data from the pdf of each page

In [8]:
import pymupdf

pdf = pymupdf.open('Tamilnadu Trip.pdf')

text = f"{chr(12)}\n\n".join([page.get_text() for page in pdf])

with open('Tamilnadu Trip text.txt', 'w') as f:
    f.write(text)

print('completed')

completed


This methd don't work properly

In [10]:
import pymupdf

pdf = pymupdf.open('Tamilnadu Trip.pdf')

text = f"{chr(12)}\n\n".join([page.get_text('blocks') for page in pdf])

with open('Tamilnadu Trip text blocks.txt', 'w') as f:
    f.write(text)

print('completed')

TypeError: sequence item 0: expected str instance, tuple found

Extracting Key Value Pairs

In [16]:
"""
Utility
--------
This demo script show how to extract key-value pairs from a page with a
"predictable" layout, as it can be found in invoices and other formalized
documents.

In such cases, a text extraction based on "words" leads to results that
are both, simple and fast and avoid using regular expressions.

The example analyzes an invoice and extracts the date, invoice number, and
various amounts.

Because of the sort, correct values for each keyword will be found if the
value's boundary box bottom is not higher than that of the keyword.
So it could just as well be on the next line. The only condition is, that
no other text exists in between.

Please note that the code works unchanged also for other supported document
types, such as XPS or EPUB, etc.
"""

import fitz

doc = fitz.open("invoice-simple.pdf")  # example document
page = doc[0]  # first page
words = page.get_text("words", sort=True)  # extract sorted words

for i, word in enumerate(words):
    # information items will be found prefixed with their "key"
    text = word[4]
    if text == "DATE:":  # the following word will be the date!
        date = words[i + 1][4]
        print("Invoice date:", date)
    elif text == "Subtotal":
        subtotal = words[i + 1][4]
        print("Subtotal:", subtotal)
    elif text == "Tax":
        tax = words[i + 1][4]
        print("Tax:", tax)
    elif text == "Price":
        price = words[i + 9][4]
        print("Price:", price)
    elif text == "INVOICE":
        inv_number = words[i + 2][4]  # skip the "#" sign
        print("Invoice number:", inv_number)
    elif text == "BALANCE":
        balance = words[i + 2][4]  # skip the word "DUE"
        print("Balance due:", balance)

Invoice date: 05/25/2023
Invoice number: 2023-512
Price: $1024.00
Subtotal: $1024.00
Tax: $0.00
Balance due: $1024.00


# How to Extract Text from within a Rectangle

Extracting Text from within a Rectangle
Extracting text from within specific rectangular areas of a document page is frequently required.

In PyMuPDF, you can select from several options to achieve this. All methods are applicable to all document types support by MuPDF - not only PDF. Choose the right method from the following list:

1. Page.get_text("words")
This is an old, standard extraction method. The method delivers a list of tuples, which each represent one string without spaces (called a "word") - together with its position. Each tuple looks like this: (x0, y0, x1, y1, "string", blocknumber, linenumber, wordnumber). The first 4 items are the coordinates of the bbox that surround "string". The last 3 items are block number on the page, line number in a block, word number in a line.

You have to write a script which selects the words contained in (or intersecting) the given rectangle by using the bbox coordinates, then sort the result, and then glue words together again that belong to the same line.

This approach can cope with documents where text is not stored in desired reading sequence: you will probably sort the word list by vertical and then by horizontal coordinates. You may also find a way to put words in the same line even if their vertical coordinates differ by some small threshold.

The script textbox-extract-1.py is an example for such a script. It also implements two word selection alternatives: one with only accepting fully contained words, and a second one including intersecting words.

2. Page.get_textbox(rect)
Returns text contained in the rectangle 'rect'. Text appears in the sequence as coded in the document. So it may not be in a desirable reading sequence. Inclusion of text is decided by character and words may hence appear mutilated. Line breaks may be present, but one final line break will be omitted. See the example script textbox-extract-2.py in this folder.

3. Page.get_text("text", clip=rect)
This is one of the old, standard extraction methods. The clip parameter is new and was introduced in version 1.17.7. If clip is not None, the result looks like the previous method's output, except that there always is a final line break.

Notes
This folder contains an example file search.pdf with one page and an annotation which shows the area to select from. The TOFU symbol in some of the outputs further down represents the big black triangle whose character bbox intersects the selection rectangle.

screen

Output of textbox-extract-1.py
This script is based on Page.get_text("words"). Words are selected in two ways: (1) whether they are fully contained in the given rectangle, or (2) whether their bbox has a non-empty intersection with it. Look at the above picture to compare these effects. The bottom vertical coordinates y1 of the words are rounded to cope with any artifacts that may be caused by e.g. font changes or similar things.

Select the words strictly contained in rectangle
------------------------------------------------
Wer eine perfekte
schaffen will, braucht
und Seife.
das schon länger und
diesbezüglich mit
aus. Unter
sie auf den
Guaran (E 412).

Select the words intersecting the rectangle
-------------------------------------------
Wer eine perfekte Seifenblase
schaffen will, braucht mehr

Wasser und Seife. Enthusiasten
sen das schon länger und tauschen
sich diesbezüglich mit Hilfe
Online-Wikis aus. Unter anderem
schwören sie auf den Lebensmittel-
zusatzstoff Guaran (E 412).
Output of textbox-extract-2.py
This is based on Page.get_textbox(rect). The selection is based on single characters: a character belongs to the party if its bbox intersects rect. Apart from this, text is selected as present in the document - including any spaces and line breaks, no reordering takes place.

This obviously is a lot simpler and may be sufficient if you have no issue with the reading sequence and properly positioning the selection rectangle.

It would also be the typical way to verify that the text found by some previous Page.search_for() really is what you have been looking for.


Wer eine perfekte Seife
schaffen will, braucht m
asser und Seife. Enthusia
n das schon länger und t
ch diesbezüglich mit Hilfe
nline-Wikis aus. Unter an
hwören sie auf den Lebe
satzstoff Guaran (E 412).

# Inspecting the Links, Annotations or Form Fields of a Page

### get all links on a page
links = page.get_links()

for link in page.links():
    
        do something with 'link'

### If dealing with a PDF document page, there may also exist annotations (Annot) or form fields (Widget), each of which have their own iterators:

for annot in page.annots():
    
        # do something with 'annot'


for field in page.widgets():
    
        # do something with 'field'

# Rendering a Page
### This example creates a raster image of a page’s content:

pix = page.get_pixmap()

pix is a Pixmap object which (in this case) contains an RGB image of the page, ready to be used for many purposes. Method Page.get_pixmap() offers lots of variations for controlling the image: resolution / DPI, colorspace (e.g. to produce a grayscale image or an image with a subtractive color scheme), transparency, rotation, mirroring, shifting, shearing, etc. For example: to create an RGBA image (i.e. containing an alpha channel), specify pix = page.get_pixmap(alpha=True).

A Pixmap contains a number of methods and attributes which are referenced below. Among them are the integers width, height (each in pixels) and stride (number of bytes of one horizontal image line). Attribute samples represents a rectangular area of bytes representing the image data (a Python bytes object).


### Note
You can also create a vector image of a page by using Page.get_svg_image(). Refer to this Vector Image Support page for details.

## Saving the Page Image in a File
we can simply store the image in a PNG file:

    pix.save('page-%i.png' % page.number)


from PIL import Image, ImageTk

# set the mode depending on alpha
mode = "RGBA" if pix.alpha else "RGB"
img = Image.frombytes(mode, [pix.width, pix.height], pix.samples)
tkimg = ImageTk.PhotoImage(img)

# The following avoids using Pillow:

# remove alpha if present
pix1 = pymupdf.Pixmap(pix, 0) if pix.alpha else pix  # PPM does not support transparency
imgdata = pix1.tobytes("ppm")  # extremely fast!
tkimg = tkinter.PhotoImage(data = imgdata)


Extracting Text and Images
We can also extract all text, images and other information of a page in many different forms, and levels of detail:

text = page.get_text(opt)
Use one of the following strings for opt to obtain different formats [2]:

“text”: (default) plain text with line breaks. No formatting, no text position details, no images.

“blocks”: generate a list of text blocks (= paragraphs).

“words”: generate a list of words (strings not containing spaces).

“html”: creates a full visual version of the page including any images. This can be displayed with your internet browser.

“dict” / “json”: same information level as HTML, but provided as a Python dictionary or resp. JSON string. See TextPage.extractDICT() for details of its structure.

“rawdict” / “rawjson”: a super-set of “dict” / “json”. It additionally provides character detail information like XML. See TextPage.extractRAWDICT() for details of its structure.

“xhtml”: text information level as the TEXT version but includes images. Can also be displayed by internet browsers.

“xml”: contains no images, but full position and font information down to each single text character. Use an XML module to interpret.


# Searching for Text
You can find out, exactly where on a page a certain text string appears:

    areas = page.search_for("mupdf")

This delivers a list of rectangles (see Rect), each of which surrounds one occurrence of the string “mupdf” (case insensitive). You could use this information to e.g. highlight those areas (PDF only) or create a cross reference of the document.

# PDF Maintenance
PDFs are the only document type that can be modified using PyMuPDF. Other file types are read-only.

However, you can convert any document (including images) to a PDF and then apply all PyMuPDF features to the conversion result.


Find out more here Document.convert_to_pdf(), and also look at the demo script pdf-converter.py which can convert any supported document to PDF.

Document.save() always stores a PDF in its current (potentially modified) state on disk.



# Modifying, Creating, Re-arranging and Deleting Pages

There are several ways to manipulate the so-called page tree (a structure describing all the pages) of a PDF:

Document.delete_page() and Document.delete_pages() delete pages.

Document.copy_page(), Document.fullcopy_page() and Document.move_page() copy or move a page to other locations within the same document.


Document.select() shrinks a PDF down to selected pages. Parameter is a sequence [3] of the page numbers that you want to keep. These integers must all be in range 0 <= i < page_count. When executed, all pages missing in this list will be deleted. Remaining pages will occur in the sequence and as many times (!) as you specify them.

So you can easily create new PDFs with

the first or last 10 pages,

only the odd or only the even pages (for doing double-sided printing),

pages that do or don’t contain a given text,

reverse the page sequence, …

… whatever you can think of.

The saved new document will contain links, annotations and bookmarks that are still valid (i.a.w. either pointing to a selected page or to some external resource).

Document.insert_page() and Document.new_page() insert new pages.

Pages themselves can moreover be modified by a range of methods (e.g. page rotation, annotation and link maintenance, text and image insertion).

# joining and Splitting PDF Documents
Method Document.insert_pdf() copies pages between different PDF documents. Here is a simple joiner example (doc1 and doc2 being opened PDFs):

### append complete doc2 to the end of doc1

    doc1.insert_pdf(doc2)

Here is a snippet that splits doc1. It creates a new document of its first and its last 10 pages:

    doc2 = pymupdf.open()                 # new empty PDF
    doc2.insert_pdf(doc1, to_page = 9)  # first 10 pages
    doc2.insert_pdf(doc1, from_page = len(doc1) - 10) # last 10 pages
    doc2.save("first-and-last-10.pdf")


# Saving
As mentioned above, Document.save() will always save the document in its current state.

You can write changes back to the original PDF by specifying option incremental=True. This process is (usually) extremely fast, since changes are appended to the original file without completely rewriting it.

Document.save() options correspond to options of MuPDF’s command line utility mutool clean, see the following table.


# Closing
It is often desirable to “close” a document to relinquish control of the underlying file to the OS, while your program continues.

This can be achieved by the Document.close() method. Apart from closing the underlying file, buffer areas associated with the document will be freed.