In [1]:
# Import the pymupdf4llm library, which provides functionality for extracting text and other elements from PDF files.
import pymupdf4llm

### Get all text in Markdown

In [2]:
# 1. Extract all text from the "transformers.pdf" document and convert it to Markdown format.
md_text = pymupdf4llm.to_markdown("transformers.pdf")

Processing transformers.pdf...


In [3]:
print(md_text)

#### Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.

## Attention Is All You Need


**Ashish Vaswani[∗]**
Google Brain
```
avaswani@google.com

```
**Llion Jones[∗]**
Google Research
```
 llion@google.com

```

**Noam Shazeer[∗]**
Google Brain
```
noam@google.com

```

**Aidan N. Gomez[∗†]**
University of Toronto
```
aidan@cs.toronto.edu

```

**Niki Parmar[∗]**
Google Research
```
nikip@google.com

```

**Jakob Uszkoreit[∗]**
Google Research
```
usz@google.com

```

**Łukasz Kaiser[∗]**
Google Brain
```
lukaszkaiser@google.com

```

**Illia Polosukhin[∗‡]**
```
illia.polosukhin@gmail.com

#### Abstract

```

The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a ne

### Save Markdown to a file

In [4]:
# 2. Save the extracted Markdown text to a file named "output.md".
import pathlib

In [5]:
pathlib.Path("output.md").write_bytes(md_text.encode())

41196

### Extract only specific pages

In [6]:
# 3. Extract text only from specific pages (pages 0 and 1) of the "transformers.pdf" document in Markdown format.
md_text_pages = pymupdf4llm.to_markdown("transformers.pdf",pages=[0,1])

Processing transformers.pdf...


In [7]:
print(md_text_pages)

#### Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.

## Attention Is All You Need


**Ashish Vaswani[∗]**
Google Brain
```
avaswani@google.com

```
**Llion Jones[∗]**
Google Research
```
 llion@google.com

```

**Noam Shazeer[∗]**
Google Brain
```
noam@google.com

```

**Aidan N. Gomez[∗†]**
University of Toronto
```
aidan@cs.toronto.edu

```

**Niki Parmar[∗]**
Google Research
```
nikip@google.com

```

**Jakob Uszkoreit[∗]**
Google Research
```
usz@google.com

```

**Łukasz Kaiser[∗]**
Google Brain
```
lukaszkaiser@google.com

```

**Illia Polosukhin[∗‡]**
```
illia.polosukhin@gmail.com

#### Abstract

```

The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a ne

### Extract Document for LlamaIndex

In [8]:
# 4. Prepare the document for use with the LlamaIndex by loading the document data in Markdown format.
llama_reader = pymupdf4llm.LlamaMarkdownReader()

Successfully imported LlamaIndex


In [9]:
llama_doc = llama_reader.load_data("transformers.pdf")

Processing transformers.pdf...
Processing transformers.pdf...
Processing transformers.pdf...
Processing transformers.pdf...
Processing transformers.pdf...
Processing transformers.pdf...
Processing transformers.pdf...
Processing transformers.pdf...
Processing transformers.pdf...
Processing transformers.pdf...
Processing transformers.pdf...
Processing transformers.pdf...
Processing transformers.pdf...
Processing transformers.pdf...
Processing transformers.pdf...


In [10]:
# Display the length of the loaded document in terms of the number of Markdown text blocks.
len(llama_doc)

15

In [11]:
# Print the first 100 characters of the first Markdown text block to verify the content.
llama_doc[0].text[:100]

'#### Provided proper attribution is provided, Google hereby grants permission to reproduce the table'

### Extract images

In [12]:
# 5. Extract images along with the Markdown text, saving images separately. 
# Set the `page_chunks=True` option to get Markdown per page, and save images as PNG files at 300 DPI in the "images" folder.
md_text_images = pymupdf4llm.to_markdown(
    doc = "transformers.pdf",
    page_chunks=True,
    write_images=True,
    image_path="images",
    image_format="png",
    dpi=300
)

Processing transformers.pdf...


In [13]:
print(md_text_images)

[{'metadata': {'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref', 'producer': 'pdfTeX-1.40.25', 'creationDate': 'D:20240410211143Z', 'modDate': 'D:20240410211143Z', 'trapped': '', 'encryption': None, 'file_path': 'transformers.pdf', 'page_count': 15, 'page': 1}, 'toc_items': [], 'tables': [], 'images': [], 'graphics': [], 'text': '#### Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.\n\n## Attention Is All You Need\n\n\n**Ashish Vaswani[∗]**\nGoogle Brain\n```\navaswani@google.com\n\n```\n**Llion Jones[∗]**\nGoogle Research\n```\n llion@google.com\n\n```\n\n**Noam Shazeer[∗]**\nGoogle Brain\n```\nnoam@google.com\n\n```\n\n**Aidan N. Gomez[∗†]**\nUniversity of Toronto\n```\naidan@cs.toronto.edu\n\n```\n\n**Niki Parmar[∗]**\nGoogle Research\n```\nnikip@google.com\n\n```\n\n**Jakob Uszkoreit[∗]**\nGoogle 

### Chunking data and Extracting with metadata

In [14]:
# 6. Extract the Markdown text with metadata for each page as separate chunks.
md_text_chunks = pymupdf4llm.to_markdown(
    doc="transformers.pdf",
    page_chunks=True
)

Processing transformers.pdf...


In [15]:
print(md_text_chunks)

[{'metadata': {'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref', 'producer': 'pdfTeX-1.40.25', 'creationDate': 'D:20240410211143Z', 'modDate': 'D:20240410211143Z', 'trapped': '', 'encryption': None, 'file_path': 'transformers.pdf', 'page_count': 15, 'page': 1}, 'toc_items': [], 'tables': [], 'images': [], 'graphics': [], 'text': '#### Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.\n\n## Attention Is All You Need\n\n\n**Ashish Vaswani[∗]**\nGoogle Brain\n```\navaswani@google.com\n\n```\n**Llion Jones[∗]**\nGoogle Research\n```\n llion@google.com\n\n```\n\n**Noam Shazeer[∗]**\nGoogle Brain\n```\nnoam@google.com\n\n```\n\n**Aidan N. Gomez[∗†]**\nUniversity of Toronto\n```\naidan@cs.toronto.edu\n\n```\n\n**Niki Parmar[∗]**\nGoogle Research\n```\nnikip@google.com\n\n```\n\n**Jakob Uszkoreit[∗]**\nGoogle 

### Detailed word by word extraction

In [16]:
# 7. Extract text with detailed word-by-word positioning. Save images to a folder named "image" with specific formatting and DPI.
md_text_words = pymupdf4llm.to_markdown(
    doc="transformers.pdf",
    page_chunks=True,
    write_images=True,
    image_path="image",
    image_format="png",
    dpi=300,
    extract_words=True
)

Processing transformers.pdf...


In [17]:
print(md_text_words[0])

{'metadata': {'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref', 'producer': 'pdfTeX-1.40.25', 'creationDate': 'D:20240410211143Z', 'modDate': 'D:20240410211143Z', 'trapped': '', 'encryption': None, 'file_path': 'transformers.pdf', 'page_count': 15, 'page': 1}, 'toc_items': [], 'tables': [], 'images': [], 'graphics': [], 'text': '#### Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.\n\n## Attention Is All You Need\n\n\n**Ashish Vaswani[∗]**\nGoogle Brain\n```\navaswani@google.com\n\n```\n**Llion Jones[∗]**\nGoogle Research\n```\n llion@google.com\n\n```\n\n**Noam Shazeer[∗]**\nGoogle Brain\n```\nnoam@google.com\n\n```\n\n**Aidan N. Gomez[∗†]**\nUniversity of Toronto\n```\naidan@cs.toronto.edu\n\n```\n\n**Niki Parmar[∗]**\nGoogle Research\n```\nnikip@google.com\n\n```\n\n**Jakob Uszkoreit[∗]**\nGoogle R

### Extract tables 

In [18]:
# 8. Extract tables (if any) from a different document "wordpress-pdf-invoice-plugin-sample.pdf" and convert to Markdown.
md_text_tables = pymupdf4llm.to_markdown(
    doc="wordpress-pdf-invoice-plugin-sample.pdf"
)

Processing wordpress-pdf-invoice-plugin-sample.pdf...


In [19]:
md_text_tables

'## Invoice\n\n\n**From:**\n[DEMO - Sliced Invoices](http://slicedinvoices.com/demo)\nSuite 5A-1204\n123 Somewhere Street\nYour City AZ 12345\nadmin@slicedinvoices.com\n\n**To:**\nTest Business\n123 Somewhere St\nMelbourne, VIC 3000\ntest@test.com\n\n|Invoice Number|INV-3337|\n|---|---|\n|Order Number|12345|\n|Invoice Date|January 25, 2016|\n|Due Date|January 31, 2016|\n|Total Due|$93.50|\n\n|Hrs/Qty|Service|d Rate/Price|Adjust|Sub Total|\n|---|---|---|---|---|\n|1.00|Web Design This is a sample description...|i $85.00|0.00%|$85.00|\n\n|Rate/Price A $85.00 0 aid|Adjust Sub Total 0.00% $85.00|\n|---|---|\n|Sub Total|$85.00|\n|Tax|$8.50|\n|Total|$93.50|\n\n\nANZ Bank\nACC # 1234 1234\nBSB # 4321 432\n\n\nPayment is due within 30 days from date of invoice. Late payment is subject to fees of 5% per month.\n[Thanks for choosing DEMO - Sliced Invoices | admin@slicedinvoices com](http://slicedinvoices.com/demo)\n\n\n-----\n\n'