`pdf2md.py`

Convert PDF to Markdown via OpenAI's multi-modal text/vision model.

Setup

OpenAI API Access

Acquire an OpenAI API key from here.

Add your Open API key to your environment (if it's not defined already):

$ export OPENAI_API_KEY="sk-..." # your key will look like "sk-..."

`poppler`

Install poppler for your system (for handling PDF to image conversion).

E.g., on macOS via the Homebrew package manager:

$ brew install poppler
...

On Ubuntu/Debian systems:

$ sudo apt-get install poppler-utils
...

For other Linux distributions, use their respective package managers:

Fedora/RHEL: sudo dnf install poppler-utils
Arch Linux: sudo pacman -S poppler

Python Dependencies

Then create a virtual environment and install the Python dependencies with uv ($ pip install uv if you don't have it in your global Python environment):

$ uv venv
...
Activate with: source .venv/bin/activate
$ source .venv/bin/activate
(pdf2md) $ uv sync
...

Usage

(pdf2md) $ ./pdf2md.py -h
usage: pdf2md.py [-h] [-c] [--first-page FIRST_PAGE] [--last-page LAST_PAGE] [--dpi DPI] [-n CONCURRENCY] pdf [output]

Script for converting PDF to Markdown via OpenAI's `gpt-4o` model.

positional arguments:
  pdf                   path to input PDF file to convert to Markdown
  output                output file where Markdown conversion will be written (or stdout) (default: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)

options:
  -h, --help            show this help message and exit
  -c, --cache-pages     cache and reuse intermediate PDF page images (default: False)
  --first-page FIRST_PAGE
                        the first page to convert (default: None)
  --last-page LAST_PAGE
                        the last page to convert (default: None)
  --dpi DPI             intermediate image resolution in dots-per-inch (DPI) (higher DPI is higher quality, but takes more memory/disk space) (default: 200)
  -n, --concurrency CONCURRENCY
                        number of pages to process in parallel (default: 8)

Example

You can download an example PDF to convert from here:

(pdf2md) $ curl -so text-and-table.pdf 'https://api.slingacademy.com/v1/sample-data/files/text-and-table.pdf'
(pdf2md) $ ./pdf2md.py text-and-table.pdf > text-and-table.pdf.md
2page [00:08,  4.21s/page]
(pdf2md) $ cat text-and-table.pdf.md 
# Sample PDF File for Practice

*Page 1: Table Data*

The table below displays information about some fictional people:

| **First Name** | **Last Name** | **Age** |
|----------------|--------------|---------|
| John           | Doe          | 99      |
| Jane           | Doo          | 29      |
| Black          | Smith        | 49      |
| Lone           | Wolf         | 35      |
| Foo            | Bar          | 5       |
| Sekiro         | Honda        | 45      |
| Elon           | Musk         | 54      |
| Catherine      | Roth         | 55      |
| Julio          | Caesar       | 58      |
| Candy          | Sweet        | 6       |
| Bo             | Kim          | 32      |
| Sling          | Academy      | 44      |
| Rantaro        | Shinsuke     | 9       |
| Cold           | Water        | 15      |
| Fried          | Chicken      | 3       |
| Blonde         | Pink         | 23      |

*This PDF file has one more page. Don’t miss it.*


----------


*Page 2*

Welcome to [www.slingacademy.com](http://www.slingacademy.com). You can find more sample data at  
[https://www.slingacademy.com/cat/sample-data/](https://www.slingacademy.com/cat/sample-data/)

Happy coding & have a nice day!


----------

Output Markdown file: text-and-table.pdf.md

If you want to save intermediate PNG images of the PDF pages, use the -c/--cache-pages options (this can speed things up a little bit if you want to rerun a conversion of a large PDF, though OpenAI API rate limits may still be the rate-limiting factor).

(pdf2md) $ ./pdf2md.py text-and-table.pdf --cache-pages > text-and-table.pdf.md         
2page [00:09,  4.78s/page]
(pdf2md) $ ls text-and-table
text-and-table.pdf0001-1.png text-and-table.pdf0002-2.png

Known Issues

PDFs that contain images cannot be faithfully fully converted since images are not extracted. Headings, lists, paragraphs, code-blocks and other basic markup that is compatible with Markdown should work, though.
Each page of the input PDF is processed independently of the others, so it's possible that some context is lost from page to page (e.g., header levels) in multi-page PDFs. As such, adjacent pages may be slightly incoherent in their markup compared to the original document.
OpenAI may enforce rate limits on your account both for requests per unit time and completion tokens generated per unit time, so if you try to process large PDFs with many pages, or even many short PDFs, the script will wait and retry for you, but it may be slow to wait for your rate limits to expire.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pdf2md.py		pdf2md.py
pyproject.toml		pyproject.toml
text-and-table.pdf		text-and-table.pdf
text-and-table.pdf.md		text-and-table.pdf.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

`pdf2md.py`

Setup

OpenAI API Access

`poppler`

Python Dependencies

Usage

Example

Known Issues

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Languages

License

zyocum/pdf2md

Folders and files

Latest commit

History

Repository files navigation

pdf2md.py

Setup

OpenAI API Access

poppler

Python Dependencies

Usage

Example

Known Issues

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Languages

`pdf2md.py`

`poppler`

Packages