---
sidebar_label: PyMuPDF4LLM
---

# PyMuPDF4LLMLoader

This notebook provides a quick overview for getting started with PyMuPDF4LLM [document loader](https://python.langchain.com/docs/concepts/#document-loaders). For detailed documentation of all PyMuPDF4LLMLoader features and configurations head to the [GitHub repository](https://github.com/lakinduboteju/langchain-pymupdf4llm).

## Overview

### Integration details

| Class | Package | Local | Serializable | JS support |
| :--- | :--- | :---: | :---: |  :---: |
| [PyMuPDF4LLMLoader](https://github.com/lakinduboteju/langchain-pymupdf4llm) | [langchain_pymupdf4llm](https://pypi.org/project/langchain-pymupdf4llm) | ✅ | ❌ | ❌ |

### Loader features

| Source | Document Lazy Loading | Native Async Support | Extract Images | Extract Tables |
| :---: | :---: | :---: | :---: | :---: |
| PyMuPDF4LLMLoader | ✅ | ❌ | ✅ | ✅ |

## Setup

To access PyMuPDF4LLM document loader you'll need to install the `langchain-pymupdf4llm` integration package.

### Credentials

No credentials are required to use PyMuPDF4LLMLoader.

If you want to get automated best in-class tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:

In [None]:
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

### Installation

Install **langchain_community** and **langchain-pymupdf4llm**.

In [1]:
%pip install -qU langchain_community langchain-pymupdf4llm

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m32.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m39.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m415.4/415.4 kB[0m [31m22.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m42.0 MB/s[0m eta [36m0:00:00[0m
[?25h

## Initialization

Now we can instantiate our model object and load documents:

In [3]:
from langchain_pymupdf4llm import PyMuPDF4LLMLoader

file_path = "/content/paper_6114f3c3a8ea.pdf"
loader = PyMuPDF4LLMLoader(file_path)

## Load

In [4]:
docs = loader.load()
docs[0]

Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-12-29T02:07:31+00:00', 'source': '/content/paper_6114f3c3a8ea.pdf', 'file_path': '/content/paper_6114f3c3a8ea.pdf', 'total_pages': 12, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2023-12-29T02:07:31+00:00', 'trapped': '', 'modDate': 'D:20231229020731Z', 'creationDate': 'D:20231229020731Z', 'page': 0}, page_content='## Fast Inference of Mixture-of-Experts Language Models with Offloading\n\n\n**Artyom Eliseev**\nMoscow Institute of Physics and Technology\nYandex School of Data Analysis\n```\n   lavawolfiee@gmail.com\n\n```\n\n**Denis Mazur**\nMoscow Institute of Physics and Technology\nYandex\nResearchcore\n```\n   denismazur8@gmail.com\n\n```\n\n### Abstract\n\n\nWith the widespread adoption of Large Language Models (LLMs), many deep\nlearning practitioners are looking for strategies of running these models more\nefficiently. One such st

In [5]:
import pprint

pprint.pp(docs[0].metadata)

{'producer': 'pdfTeX-1.40.25',
 'creator': 'LaTeX with hyperref',
 'creationdate': '2023-12-29T02:07:31+00:00',
 'source': '/content/paper_6114f3c3a8ea.pdf',
 'file_path': '/content/paper_6114f3c3a8ea.pdf',
 'total_pages': 12,
 'format': 'PDF 1.5',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'moddate': '2023-12-29T02:07:31+00:00',
 'trapped': '',
 'modDate': 'D:20231229020731Z',
 'creationDate': 'D:20231229020731Z',
 'page': 0}


## Lazy Load

In [6]:
pages = []
for doc in loader.lazy_load():
    pages.append(doc)
    if len(pages) >= 10:
        # do some paged operation, e.g.
        # index.upsert(page)

        pages = []
len(pages)

2

In [7]:
from IPython.display import Markdown, display

part = pages[0].page_content[778:1189]
print(part)
# Markdown rendering
display(Markdown(part))

,
Wang, S., Maynez, J., Phuong, M., Tobin, T., Tacchetti, A., Trebacz, M., Robinson, K., Katariya,
Y., Riedel, S., Bailey, P., Xiao, K., Ghelani, N., Aroyo, L., Slone, A., Houlsby, N., Xiong, X., Yang,
Z., Gribovskaya, E., Adler, J., Wirth, M., Lee, L., Li, M., Kagohara, T., Pavagadhi, J., Bridgers, S.,
Bortsova, A., Ghemawat, S., Ahmed, Z., Liu, T., Powell, R., Bolina, V., Iinuma, M., Zablotskaia,
P., Besle


,
Wang, S., Maynez, J., Phuong, M., Tobin, T., Tacchetti, A., Trebacz, M., Robinson, K., Katariya,
Y., Riedel, S., Bailey, P., Xiao, K., Ghelani, N., Aroyo, L., Slone, A., Houlsby, N., Xiong, X., Yang,
Z., Gribovskaya, E., Adler, J., Wirth, M., Lee, L., Li, M., Kagohara, T., Pavagadhi, J., Bridgers, S.,
Bortsova, A., Ghemawat, S., Ahmed, Z., Liu, T., Powell, R., Bolina, V., Iinuma, M., Zablotskaia,
P., Besle

In [8]:
pprint.pp(pages[0].metadata)

{'producer': 'pdfTeX-1.40.25',
 'creator': 'LaTeX with hyperref',
 'creationdate': '2023-12-29T02:07:31+00:00',
 'source': '/content/paper_6114f3c3a8ea.pdf',
 'file_path': '/content/paper_6114f3c3a8ea.pdf',
 'total_pages': 12,
 'format': 'PDF 1.5',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'moddate': '2023-12-29T02:07:31+00:00',
 'trapped': '',
 'modDate': 'D:20231229020731Z',
 'creationDate': 'D:20231229020731Z',
 'page': 10}


The metadata attribute contains at least the following keys:
- source
- page (if in mode *page*)
- total_page
- creationdate
- creator
- producer

Additional metadata are specific to each parser.
These pieces of information can be helpful (to categorize your PDFs for example).

## Splitting mode & custom pages delimiter

When loading the PDF file you can split it in two different ways:
- By page
- As a single text flow

By default PyMuPDF4LLMLoader will split the PDF by page.

### Extract the PDF by page. Each page is extracted as a langchain Document object:

In [10]:
loader = PyMuPDF4LLMLoader(
    "/content/paper_6114f3c3a8ea.pdf",
    mode="page",
)
docs = loader.load()

print(len(docs))
pprint.pp(docs[0].metadata)

12
{'producer': 'pdfTeX-1.40.25',
 'creator': 'LaTeX with hyperref',
 'creationdate': '2023-12-29T02:07:31+00:00',
 'source': '/content/paper_6114f3c3a8ea.pdf',
 'file_path': '/content/paper_6114f3c3a8ea.pdf',
 'total_pages': 12,
 'format': 'PDF 1.5',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'moddate': '2023-12-29T02:07:31+00:00',
 'trapped': '',
 'modDate': 'D:20231229020731Z',
 'creationDate': 'D:20231229020731Z',
 'page': 0}


In this mode the pdf is split by pages and the resulting Documents metadata contains the `page` (page number). But in some cases we could want to process the pdf as a single text flow (so we don't cut some paragraphs in half). In this case you can use the *single* mode :

### Extract the whole PDF as a single langchain Document object:

In [12]:
loader = PyMuPDF4LLMLoader(
    "/content/paper_6114f3c3a8ea.pdf",
    mode="single",
)
docs = loader.load()

print(len(docs))
pprint.pp(docs[0].metadata)

1
{'producer': 'pdfTeX-1.40.25',
 'creator': 'LaTeX with hyperref',
 'creationdate': '2023-12-29T02:07:31+00:00',
 'source': '/content/paper_6114f3c3a8ea.pdf',
 'file_path': '/content/paper_6114f3c3a8ea.pdf',
 'total_pages': 12,
 'format': 'PDF 1.5',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'moddate': '2023-12-29T02:07:31+00:00',
 'trapped': '',
 'modDate': 'D:20231229020731Z',
 'creationDate': 'D:20231229020731Z'}


Logically, in this mode, the `page` (page_number) metadata disappears. Here's how to clearly identify where pages end in the text flow :

### Add a custom *pages_delimiter* to identify where are ends of pages in *single* mode:

In [13]:
loader = PyMuPDF4LLMLoader(
    "/content/paper_6114f3c3a8ea.pdf",
    mode="single",
    pages_delimiter="\n-------THIS IS A CUSTOM END OF PAGE-------\n\n",
)
docs = loader.load()

part = docs[0].page_content[10663:11317]
print(part)
display(Markdown(part))

the experts that constitute vast majority of model parameters do not fit even
with quantization. Finally, even if we could fit the model parameters in memory, running generative
inference requires additional memory for layer activations and past attention keys & values.

### 3 Method

In this work, we aim to systematically find the optimal way to inference modern Mixture-of-Experts
LLMs on desktop or low-end cloud instances. More specifically, we focus on the task of generating
tokens interactively, i.e. generate multiple tokens per second at batch size 1[5].

The generative inference workload consists of two phases: 1) encoding the input prompt 


the experts that constitute vast majority of model parameters do not fit even
with quantization. Finally, even if we could fit the model parameters in memory, running generative
inference requires additional memory for layer activations and past attention keys & values.

### 3 Method

In this work, we aim to systematically find the optimal way to inference modern Mixture-of-Experts
LLMs on desktop or low-end cloud instances. More specifically, we focus on the task of generating
tokens interactively, i.e. generate multiple tokens per second at batch size 1[5].

The generative inference workload consists of two phases: 1) encoding the input prompt 

The default `pages_delimiter` is \n-----\n\n.
But this could simply be \n, or \f to clearly indicate a page change, or \<!-- PAGE BREAK --> for seamless injection in a Markdown viewer without a visual effect.

# Extract images from the PDF

You can extract images from your PDFs (in text form) with a choice of three different solutions:
- rapidOCR (lightweight Optical Character Recognition tool)
- Tesseract (OCR tool with high precision)
- Multimodal language model

The result is inserted at the end of text of the page.

### Extract images from the PDF with rapidOCR:

In [14]:
%pip install -qU rapidocr-onnxruntime pillow

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.9/14.9 MB[0m [31m91.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.0/16.0 MB[0m [31m84.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m969.6/969.6 kB[0m [31m50.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [15]:
from langchain_community.document_loaders.parsers import RapidOCRBlobParser

loader = PyMuPDF4LLMLoader(
    "/content/paper_6114f3c3a8ea.pdf",
    mode="page",
    extract_images=True,
    images_parser=RapidOCRBlobParser(),
)
docs = loader.load()

part = docs[5].page_content[1863:]
print(part)
display(Markdown(part))

4.2** **Mixed MoE Quantization**


Next, we test how different Quantization schemes affect MoE performance and size. We also use
Mixtral-8x7B, but this time, we use non-instruction-tuned variant since it fits better with the available
benchmarks. We measure WikiText2 perpliexity Merity et al. (2016), C4 perplexity Raffel et al.
(2020), as well as 5-shot MMLU accuracy Hendrycks et al. (2021). Our objective for this section is
to find the best trade off between size and performance for offloading with the target setups. Note
that out of 46.7B total parameters in the Mixtral-8x7B model, the experts constitute 45.1B (96.6%).
The rest of the model parameters are allocated to embeddings, self-attention layers, MoE gates and
minor layers such as LayerNorm.


**Attn** **Experts** **Model**
**C4** **MMLU**
**quant** **quant** **size, GB [Wiki2]**


**Attn** **Experts** **Model**
**C4** **MMLU**
**quant** **quant** **size, GB [Wiki2]**


FP16

4-bit


FP16 86.99 3.59 6.52 70.51%
4-bit 25.82 3.67

4.2** **Mixed MoE Quantization**


Next, we test how different Quantization schemes affect MoE performance and size. We also use
Mixtral-8x7B, but this time, we use non-instruction-tuned variant since it fits better with the available
benchmarks. We measure WikiText2 perpliexity Merity et al. (2016), C4 perplexity Raffel et al.
(2020), as well as 5-shot MMLU accuracy Hendrycks et al. (2021). Our objective for this section is
to find the best trade off between size and performance for offloading with the target setups. Note
that out of 46.7B total parameters in the Mixtral-8x7B model, the experts constitute 45.1B (96.6%).
The rest of the model parameters are allocated to embeddings, self-attention layers, MoE gates and
minor layers such as LayerNorm.


**Attn** **Experts** **Model**
**C4** **MMLU**
**quant** **quant** **size, GB [Wiki2]**


**Attn** **Experts** **Model**
**C4** **MMLU**
**quant** **quant** **size, GB [Wiki2]**


FP16

4-bit


FP16 86.99 3.59 6.52 70.51%
4-bit 25.82 3.67 6.58 70.3%
3-bit 23.21 3.96 6.78 69.32%
2-bit 19.33 4.52 7.31 66.66%

FP16 85.16 3.68 6.59 —
4-bit 23.99 3.76 6.66 69.11%
3-bit 21.37 4.05 6.87 68.47%
2-bit 17.54 4.61 7.42 65.58%


3-bit

2-bit


FP16 85.08 3.99 6.90 —
4-bit 23.92 4.06 6.97 66.54%
3-bit 21.31 4.34 7.21 65.79%
2-bit 17.46 4.90 7.82 61.83%

FP16 84.96 4.98 7.92 —
4-bit 23.79 5.08 8.06 59.0%
3-bit 21.18 5.36 8.34 57.67%
2-bit 17.30 5.97 9.11 55.26%


Table 1: Perplexity and model size evaluation of Mixtral-8x7B with different quantization for shared
attention (Attn quant) and experts (Experts quant) layers. For comprarison, a Mistral-7B 4-bit
quantized model has Wiki2 perplexity 5.03, C4 perplexity 7.56 and MMLU score 61.3%. See Section
4.2 for details. Green values correspond to the configurations we chose for full system evaluation.



Be careful, RapidOCR is designed to work with Chinese and English, not other languages.

### Extract images from the PDF with Tesseract:

In [17]:
%pip install -qU pytesseract

In [18]:
from langchain_community.document_loaders.parsers import TesseractBlobParser

loader = PyMuPDF4LLMLoader(
    "/content/paper_6114f3c3a8ea.pdf",
    mode="page",
    extract_images=True,
    images_parser=TesseractBlobParser(),
)
docs = loader.load()

print(docs[5].page_content[1863:])

TesseractNotFoundError: tesseract is not installed or it's not in your PATH. See README file for more information.

### Extract images from the PDF with multimodal model:

In [None]:
%pip install -qU langchain_openai

Note: you may need to restart the kernel to use updated packages.


In [None]:
import os

from dotenv import load_dotenv

load_dotenv()

True

In [None]:
from getpass import getpass

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("OpenAI API key =")

In [None]:
from langchain_community.document_loaders.parsers import LLMImageBlobParser
from langchain_openai import ChatOpenAI

loader = PyMuPDF4LLMLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="page",
    extract_images=True,
    images_parser=LLMImageBlobParser(
        model=ChatOpenAI(model="gpt-4o-mini", max_tokens=1024)
    ),
)
docs = loader.load()

print(docs[5].page_content[1863:])

# Extract tables from the PDF

With PyMUPDF4LLM you can extract tables from your PDFs in *markdown* format :

In [None]:
loader = PyMuPDF4LLMLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="page",
    # "lines_strict" is the default strategy and
    # is the most accurate for tables with column and row lines,
    # but may not work well with all documents.
    # "lines" is a less strict strategy that may work better with
    # some documents.
    # "text" is the least strict strategy and may work better
    # with documents that do not have tables with lines.
    table_strategy="lines",
)
docs = loader.load()

part = docs[4].page_content[3210:]
print(part)
display(Markdown(part))

## Working with Files

Many document loaders involve parsing files. The difference between such loaders usually stems from how the file is parsed, rather than how the file is loaded. For example, you can use `open` to read the binary content of either a PDF or a markdown file, but you need different parsing logic to convert that binary data into text.

As a result, it can be helpful to decouple the parsing logic from the loading logic, which makes it easier to re-use a given parser regardless of how the data was loaded.
You can use this strategy to analyze different files, with the same parsing parameters.

In [None]:
from langchain_community.document_loaders import FileSystemBlobLoader
from langchain_community.document_loaders.generic import GenericLoader
from langchain_pymupdf4llm import PyMuPDF4LLMParser

loader = GenericLoader(
    blob_loader=FileSystemBlobLoader(
        path="./example_data/",
        glob="*.pdf",
    ),
    blob_parser=PyMuPDF4LLMParser(),
)
docs = loader.load()

part = docs[0].page_content[:562]
print(part)
display(Markdown(part))

## API reference

For detailed documentation of all PyMuPDF4LLMLoader features and configurations head to the GitHub repository: https://github.com/lakinduboteju/langchain-pymupdf4llm