# PyMuPDF: A Powerful PDF Toolkit
This IPython Notebook provides a practical demonstration of how to use the PyMuPDF library for PDF manipulation and information extraction, with a focus on its potential applications in Retrieval Augmented Generation (RAG) systems. Here's a breakdown of its content:

## Document Level Operations

* Demonstrates how to open a PDF document using `pymupdf.open()`.
* Shows how to retrieve document-level information like:
    * Document name (`doc.name`)
    * Page count (`doc.page_count`)
    * Table of contents (`doc.get_toc()`)
    * Metadata (`doc.metadata`)
    * Encryption status (`doc.is_encrypted`)
    * Password protection status (`doc.needs_pass`)
    * Permissions (`doc.permissions`)
    * Document outline (`doc.outline`)
    * Chapter count (`doc.chapter_count`)
    * Page layout (`doc.pagelayout`)
    * PDF verification (`doc.is_pdf`)

## Page Level Operations

* Provides an overview of various page-level methods:
    * `page.get_text()`: For extracting text content.
    * `page.get_images()`: For extracting images.
    * `page.search_for()`: For finding text occurrences.
    * `page.find_tables()`: For detecting and extracting tables.
    * `page.get_links()`: For extracting hyperlinks.
    * `page.get_annots()`: For retrieving annotations.
* Demonstrates how to extract text from a specific page.
* Shows how to extract images from a page and from the entire PDF.
* Includes code to extract drawings, although it's not executed in the provided notebook.
* Demonstrates table extraction.

## Text Extraction Methods

* Explores various methods for extracting text from a page:
    * `get_textpage()`: Gets the text page object.
    * `extractText()`: Extracts the page's text.
    * `extractBLOCKS()`: Extracts text as blocks.
    * `extractWORDS()`: Extracts words.
    * `extractHTML()`: Extracts text as HTML.
    * `extractDICT()`: Extracts text as a dictionary.
    * `extractRAWDICT()`: Extracts text as a raw dictionary.
    * `extractJSON()`: Extracts text as JSON.
    * `extractRAWJSON()`: Extracts text as raw JSON.
    * `rect`: Gets the bounding box coordinates for the entire page.
* Shows how to prettify JSON output.
* Demonstrates how to identify all tables in a PDF and convert extracted table data into a pandas DataFrame.

## PyMuPDF4LLM Integration

* Uses `pymupdf4llm.to_markdown()` to convert a specified page of a PDF document to Markdown format.
* Shows how to extract metadata, table data, images, graphics, and text using `pymupdf4llm`.

---
## Colab Setup

To run this notebook in Colab run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [1]:
PROJECT_ID = 'vertexai-demo-ltfpzhaw' # replace with project ID  #statmike-mlops-349915

In [2]:
try:
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
    print('Colab authorized to GCP')
except Exception:
    print('Not a Colab Environment')
    pass

Not a Colab Environment


---
## Installs

The list `packages` contains tuples of package import names and install names.  If the import name is not found then the install name is used to install quitely for the current user.

In [3]:
# tuples of (import name, install name, min_version)
packages = [
    ('google.cloud.aiplatform', 'google-cloud-aiplatform'),
    ('pymupdf', 'pymupdf'),
    ('pymupdf4llm', 'pymupdf4llm')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user
    elif len(package) == 3:
        if importlib.metadata.version(package[0]) < package[2]:
            print(f'updating package {package[1]}')
            install = True
            !pip install {package[1]} -U -q --user

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [4]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)
    IPython.display.display(IPython.display.Markdown("""<div class=\"alert alert-block alert-warning\">
        <b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. The previous cells do not need to be run again⚠️</b>
        </div>"""))

---
## Setup

inputs:

In [5]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'vertexai-demo-ltfpzhaw'

In [None]:
REGION = 'us-central1'
SERIES = 'applied-genai'
EXPERIMENT = 'pymupdf'

packages:

In [None]:
# Standard Libraries
import os, time, io, re, json, requests, pandas as pd

# Google Cloud
import google.cloud.aiplatform as aiplatform

# Vertex AI
import vertexai

# PDF Processing (PyMuPdf)
import fitz
import pymupdf4llm

# IPython Display
from IPython.display import display, Image, HTML, Markdown

In [8]:
aiplatform.__version__

'1.75.0'

In [None]:
vertexai.init(project = PROJECT_ID, location = REGION)

Parameters

In [10]:
DIR = f"files/{EXPERIMENT}"

Environment

In [11]:
if not os.path.exists(DIR):
    os.makedirs(DIR)

---
## Documents

Retrieve the documents and store in GCS for processing.

### Retrieve Documents

In [12]:
freddie_url = 'https://guide.freddiemac.com/ci/okcsFattach/get/1002095_2'

In [13]:
freddie_retrieve = requests.get(freddie_url).content

In [14]:
freddie_doc = fitz.open(stream = freddie_retrieve, filetype = 'pdf')

In [15]:
freddie_doc.page_count

2450

---
## Document Level

**open** loads the PDF document into memory and is the gateway to utilizing PyMuPDF's capabilities.

#### Get Document Name

In [17]:
doc = freddie_doc

In [18]:
# name parameter will return file name when file is loaded from local path. In this example, file is opened using a stream
doc.name

''

#### Get Page Count

Extracting the page count of the PDF using **page_count**

In [19]:
doc.page_count

2450

#### Get Table of Contents

Extracting the table of contents as a list using **get_toc**

In [None]:
doc.get_toc()

#### Get document metadata

**metadata** returns a dictionary containing metadata like author, creation date, keywords, etc.

In [None]:
doc.metadata

#### Get encryption status

In [None]:
doc.is_encrypted

#### Get Password Protection Status

In [None]:
doc.needs_pass

#### Get the list of permissions

In [None]:
doc.permissions

#### Get document outline

In [None]:
doc.outline

#### Get Chapter Count

In [None]:
doc.chapter_count

In [None]:
doc.is_reflowable

In [None]:
doc.pagelayout

#### Verify Document Type (pdf only)

In [None]:
doc.is_pdf

## Page Level

 - page.get_text(): Extract text content with various options (plain text, HTML, dict, etc.). Explore different output modes like 'simple', 'blocks', and 'layout' for varied text extraction.
 - page.get_images(): Get a list of images. Each image object allows access to image data and properties.
 - page.search_for(): Find text occurrences and their bounding boxes on the page.
 - page.find_tables(): Detect and extract tables from the page. You can then use table.extract() to get the data.
 - page.get_links(): Extract hyperlinks within the page.
 - page.get_annots(): Retrieve annotations like highlights, underlines, and text notes.

In [None]:
doc[0].get_text()

#### Extract Images from a specific page

In [None]:
doc[0].get_images()

#### Extract all the images from the PDF

In [None]:
for pg in range(doc.page_count):
    for img in doc[pg].get_images(full=True):
        print(f'Page Number: {pg+1}')
        display(Image(data = doc.extract_image(img[0])["image"]))

## Text Extraction Methods

In [None]:
doc[0].get_textpage()

#### Extracting Text

Extracts the text content of a page as a single, continuous string

In [None]:
doc[0].get_textpage().extractText()

#### Extracting Texts as BLOCKS

- Extracts text as a list of text blocks
- Each block represents a distinct text area on the page (e.g., a paragraph, a heading).
- Provides more structure than extractText(), including coordinates and block types.

In [None]:
doc[0].get_textpage().extractBLOCKS()

#### Extracting Words

- Extracts text as a list of individual words.
- Also provides coordinate information for each word.
- Useful for tasks that require word-level analysis.

In [None]:
doc[0].get_textpage().extractWORDS()

#### Extracting the Text as HTML

- Extracts the text content of a page and formats it as HTML.
- Preserves some layout information (e.g., paragraphs, tables) using HTML tags.
- Useful for displaying PDF text in a web browser or further processing with HTML tools.

In [None]:
doc[0].get_textpage().extractHTML()

display(HTML(doc[0].get_textpage().extractHTML()))

#### Extracting Text as DICT

- Extracts text as a Python dictionary.
- Provides a structured representation of the page content, including blocks, lines, and spans of text.
- Offers detailed layout information.

In [None]:
doc[0].get_textpage().extractDICT()

#### Extracting Text as RAWDICT

- Similar to extractDICT(), but provides a "rawer" or more direct representation of the PDF's internal structure.
- May include more technical details.

In [None]:
doc[0].get_textpage().extractRAWDICT()

#### Extracting Text as JSON

- Extracts text in JSON format (a string representing a JSON object).
- A serialized version of the data returned by extractDICT().
- Useful for data exchange and integration with other systems.

In [None]:
doc[0].get_textpage().extractJSON()

Prettifying JSON

In [None]:
json.loads(doc[0].get_textpage().extractJSON())

#### Extracting Text as RAWJSON

- Extracts text as JSON, based on the data from extractRAWDICT()

In [None]:
doc[0].get_textpage().extractRAWJSON()

Prettifying JSON

In [None]:
json.loads(doc[0].get_textpage().extractRAWJSON())

#### Extracting bounding box coordinates for the entire page using rect

- Returns the bounding box coordinates of the entire page.
- Useful for knowing the page dimensions.

In [None]:
doc[0].get_textpage().rect

#### Identify all the tables

In [None]:
for page_num, page in enumerate(doc):
        for table in page.find_tables():
            print(f"Table found on page {page_num + 1}:")
            print(f"  - Bounding box: {table.bbox}")
            table_data = table.extract()
            table_data
            # Create a pandas DataFrame
            df = pd.DataFrame(table_data)

            # Display the DataFrame
            display(df)

#### Converting PDF to markdown for RAG Related Chunking

In [None]:
md_text = pymupdf4llm.to_markdown(doc = doc,page_chunks = True)

In [None]:
def process_and_save_chunks(md_text, output_file_path):
    """
    Processes the md_text list of dictionaries and saves all chunks into a single JSON Lines file.

    Args:
        md_text: A list of dictionaries, where each dictionary represents a chunk.
        output_file_path: The path to the output JSON Lines file.
    """
    all_chunks = []  # List to store all chunk dictionaries

    for pg_num, chunk_data in enumerate(md_text):
        gse = 'freddie'
        file_name = str(chunk_data['metadata']['file_path'])
        file_chunk_id = str(chunk_data['metadata']['page'])
        chunk_id = f'freddie_part_{pg_num + 1}'
        content = chunk_data['text']

        chunk_json = {
            'gse': gse,
            'filename': file_name,
            'file_chunk_id': file_chunk_id,
            'chunk_id': chunk_id,
            'content': content,
        }
        all_chunks.append(chunk_json)

    # Write all chunks to a single JSON Lines file
    with open(output_file_path, 'w') as f:
        for chunk in all_chunks:
            f.write(json.dumps(chunk) + '\n')

In [None]:
chunk_json = {}

In [None]:
process_and_save_chunks(md_text, output_file_path = f'{DIR}/document-chunks.jsonl' )

#### Text Extraction Flags

https://pymupdf.readthedocs.io/en/latest/vars.html#constants