Readme

About

This is a command-line tool designed to handle PDF files, with the goal of seamlessly integrating PDFs into various note-taking workflows. Currently, the following capabilities are offered:

Managing the TOC: Export the table of contents from a PDF to a user-friendly, plain text list and import it back into the PDF after any modifications. This TOC file is easy to read and convenient for users to modify. The format can be viewed here.
Annotations Management:
- Export formatted text annotations: Extract annotations like highlights, text, squares, and other types of annots from a PDF, capture relevant document images, and support OCR extraction from these images. Using Mako templates, you can import the formatted PDF annotations into your preferred note-taking system.
- Manage XFDF Annotations: Export XFDF annotations from the PDF and import them back into the PDF. XFDF files can be imported by PDF readers like XChange. Since some OCR software may flatten annotations during the OCR process, you can export to XFDF before OCR and then import the XFDF after OCR to retain the full annotation functionality.
- Delete annotations from the PDF: Easily share original PDFs with others.
Page Label and Number Conversion: Convert page labels to page numbers and vice versa. Sometimes, while the data is stored as a page number, readers might require navigation based on page labels. This feature addresses that discrepancy.

Start

git clone --depth 1 https://github.com/yuchen-lea/pdfhelper.git
cd pdfhelper
make install
pip install -r requirements.txt

Then you can use the pdfhelper command-line tool!

For subsequent updates, simply pull the latest code from git and run make update.

Usage

pdfhelper -h

usage: pdfhelper [-h] [--version]
                 {export-toc,import-toc,delete-annot,export-xfdf-annot,import-xfdf-annot,export-annot,page-label-to-number,page-number-to-label}
                 ... INFILE

Some useful functions to process a PDF file.

positional arguments:
  {export-toc,import-toc,delete-annot,export-xfdf-annot,import-xfdf-annot,export-annot,page-label-to-number,page-number-to-label}
    export-toc          Export the TOC of the PDF.
    import-toc          Import TOC from a file into the PDF.
    delete-annot        Delete annotations from the PDF.
    export-xfdf-annot   Export XFDF annotations of the PDF.
    import-xfdf-annot   Import XFDF annotations of the PDF.
    export-annot        Export formatted text annotations of the PDF.
    page-label-to-number
                        Convert page label to page number.
    page-number-to-label
                        Convert page number to page label.
  INFILE                PDF file to process

options:
  -h, --help            show this help message and exit
  --version, -v         show program's version number and exit

TOC format

Sample toc file:

- Cover#1
- The Ten Commandments#2
- The Five Rules#3
- Contents#8
- Foreword#10
- Preface#12
# 1 = 16
- 1. Toys#2
- 2. Do It, Do It Again, and Again, and Again ...#14
- 3. Cons the Magnificent#32
# +2
- 4. Numbers Games#58

Here, you see three ways of customization:

Defining a Page Number: To bookmark page 3 with the title “The Five Rules”:
```
- The Five Rules#3
    
```
- 🙋‍ List indentation is the same as toc indentation.
Setting the First Page: To bookmark page 17 with the title “1. Toys” (considering the first page is numbered 16):
```
# 1 = 16
- 1. Toys#2
    
```
- 🙋‍Note: Using the pattern “# number1=number2” will treat the physical page number2 of the PDF as number1. Any subsequent page numbers set as ‘x’ will actually point to the physical page calculated as x+number2-number1. This is suitable for setting the first page number, for example, “# 1=19”, as well as setting the starting page number for the second volume, like “# 250=5”.
Accounting for Page Gaps: To bookmark the title “4. Numbers Games” on page 75 (calculated as 58 + (16-1) + 2):
```
# +2
- 4. Numbers Games#58
    
```
- useful when there are missing or extra pages. At the location of missing pages (for instance, where blank pages counted in the pagination have been removed), set “# -[number of missing pages]”. At the location where pages are added (like illustration pages not counted in the pagination), set “# +[number of added pages]”.

Export Annotations

Currently, the following annotation types are supported:

Type	Result
Text	comment
FreeText	comment
Square	comment + picture (set the zoom factor by `--image-zoom`) + text (extract from the PDF, or use the `--ocr-service` and `--ocr-language` to recognize text within images.)
Highlight	comment + text (extract from the PDF)
Underline	comment + text (extract from the PDF)
Squiggly	comment + text (extract from the PDF)
StrikeOut	comment + text (extract from the PDF)
Ink	comment + picture (captures the content within the marked height of the document, rather than just the mark itself. set the zoom factor by `--image-zoom`) + text (extract from the PDF, or use the `--ocr-service` and `--ocr-language` to recognize text within images.)
Line	comment + picture (captures the content within the marked height of the document, rather than just the mark itself. set the zoom factor by `--image-zoom`) + text (extract from the PDF, or use the `--ocr-service` and `--ocr-language` to recognize text within images.)

You can customize the note format by:

--with-toc
--toc-list-item-format
--annot-list-item-format

Changelog

2.4.0
- new feature export-info: Export PDF info to xml file
2.3.0
- export-annot supports Mako templates
2.2.0
- new feature import-xfdf-annot
2.1.0
- new feature export-xfdf-annot
2.0.0
- ⭐ Update argument parsing structure to use subparsers for clearer command distinction.
- add Makefile to install and uninstall script
1.4.0
- new feature delete-annot: Delete all annots in pdf
1.3.0
- improve feature import-toc: Support set the first page and fix a gap. See more info here
1.2.0
- new feature export-annot: Export the annotations of PDF
1.1.0
- new feature export-toc: Export the toc of pdf to human-readable file. You can see the format here
- new feature import-toc: Import the toc of pdf, the toc shares the same format with the exported one

Credits

This project is inspired by the following tool:

0xabu/pdfannots: Extracts and formats text annotations from a PDF file: based on pdfminer and format as markdown text. It deals with hyphens but donot extract rectangle annot.
PDFPatcher(Chinese) a great pdf utility tool.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.gitignore		.gitignore
Makefile		Makefile
README.org		README.org
README_CN.org		README_CN.org
format_annots_template.py.example		format_annots_template.py.example
ocr_config.ini.example		ocr_config.ini.example
pdf_handler.py		pdf_handler.py
pdfhelper.py		pdfhelper.py
picture_handler.py		picture_handler.py
requirements.txt		requirements.txt
toc_handler.py		toc_handler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Readme

About

Start

Usage

TOC format

Export Annotations

Changelog

Credits

About

Releases

Packages

Languages

yuchen-lea/pdfhelper

Folders and files

Latest commit

History

Repository files navigation

Readme

About

Start

Usage

TOC format

Export Annotations

Changelog

Credits

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages