Skip to content

yuchen-lea/pdfhelper

Repository files navigation

Readme

中文文档

About

This is a command-line tool designed to handle PDF files, with the goal of seamlessly integrating PDFs into various note-taking workflows. Currently, the following capabilities are offered:

  1. Managing the TOC: Export the table of contents from a PDF to a user-friendly, plain text list and import it back into the PDF after any modifications. This TOC file is easy to read and convenient for users to modify. The format can be viewed here.
  2. Annotations Management:
    • Export formatted text annotations: Extract annotations like highlights, text, squares, and other types of annots from a PDF, capture relevant document images, and support OCR extraction from these images. Using Mako templates, you can import the formatted PDF annotations into your preferred note-taking system.
    • Manage XFDF Annotations: Export XFDF annotations from the PDF and import them back into the PDF. XFDF files can be imported by PDF readers like XChange. Since some OCR software may flatten annotations during the OCR process, you can export to XFDF before OCR and then import the XFDF after OCR to retain the full annotation functionality.
    • Delete annotations from the PDF: Easily share original PDFs with others.
  3. Page Label and Number Conversion: Convert page labels to page numbers and vice versa. Sometimes, while the data is stored as a page number, readers might require navigation based on page labels. This feature addresses that discrepancy.

Start

git clone --depth 1 https://github.com/yuchen-lea/pdfhelper.git
cd pdfhelper
make install
pip install -r requirements.txt

Then you can use the pdfhelper command-line tool!

For subsequent updates, simply pull the latest code from git and run make update.

Usage

pdfhelper -h
usage: pdfhelper [-h] [--version]
                 {export-toc,import-toc,delete-annot,export-xfdf-annot,import-xfdf-annot,export-annot,page-label-to-number,page-number-to-label}
                 ... INFILE

Some useful functions to process a PDF file.

positional arguments:
  {export-toc,import-toc,delete-annot,export-xfdf-annot,import-xfdf-annot,export-annot,page-label-to-number,page-number-to-label}
    export-toc          Export the TOC of the PDF.
    import-toc          Import TOC from a file into the PDF.
    delete-annot        Delete annotations from the PDF.
    export-xfdf-annot   Export XFDF annotations of the PDF.
    import-xfdf-annot   Import XFDF annotations of the PDF.
    export-annot        Export formatted text annotations of the PDF.
    page-label-to-number
                        Convert page label to page number.
    page-number-to-label
                        Convert page number to page label.
  INFILE                PDF file to process

options:
  -h, --help            show this help message and exit
  --version, -v         show program's version number and exit

TOC format

Sample toc file:

- Cover#1
- The Ten Commandments#2
- The Five Rules#3
- Contents#8
- Foreword#10
- Preface#12
# 1 = 16
- 1. Toys#2
- 2. Do It, Do It Again, and Again, and Again ...#14
- 3. Cons the Magnificent#32
# +2
- 4. Numbers Games#58

Here, you see three ways of customization:

  1. Defining a Page Number: To bookmark page 3 with the title “The Five Rules”:
    - The Five Rules#3
        
    • 🙋‍ List indentation is the same as toc indentation.
  2. Setting the First Page: To bookmark page 17 with the title “1. Toys” (considering the first page is numbered 16):
    # 1 = 16
    - 1. Toys#2
        
    • 🙋‍Note: Using the pattern “# number1=number2” will treat the physical page number2 of the PDF as number1. Any subsequent page numbers set as ‘x’ will actually point to the physical page calculated as x+number2-number1. This is suitable for setting the first page number, for example, “# 1=19”, as well as setting the starting page number for the second volume, like “# 250=5”.
  3. Accounting for Page Gaps: To bookmark the title “4. Numbers Games” on page 75 (calculated as 58 + (16-1) + 2):
    # +2
    - 4. Numbers Games#58
        
    • useful when there are missing or extra pages. At the location of missing pages (for instance, where blank pages counted in the pagination have been removed), set “# -[number of missing pages]”. At the location where pages are added (like illustration pages not counted in the pagination), set “# +[number of added pages]”.

Export Annotations

Currently, the following annotation types are supported:

TypeResult
Textcomment
FreeTextcomment
Squarecomment + picture (set the zoom factor by --image-zoom) + text (extract from the PDF, or use the --ocr-service and --ocr-language to recognize text within images.)
Highlightcomment + text (extract from the PDF)
Underlinecomment + text (extract from the PDF)
Squigglycomment + text (extract from the PDF)
StrikeOutcomment + text (extract from the PDF)
Inkcomment + picture (captures the content within the marked height of the document, rather than just the mark itself. set the zoom factor by --image-zoom) + text (extract from the PDF, or use the --ocr-service and --ocr-language to recognize text within images.)
Linecomment + picture (captures the content within the marked height of the document, rather than just the mark itself. set the zoom factor by --image-zoom) + text (extract from the PDF, or use the --ocr-service and --ocr-language to recognize text within images.)

You can customize the note format by:

  • --with-toc
  • --toc-list-item-format
  • --annot-list-item-format

Changelog

  • 2.4.0
    • new feature export-info: Export PDF info to xml file
  • 2.3.0
    • export-annot supports Mako templates
  • 2.2.0
    • new feature import-xfdf-annot
  • 2.1.0
    • new feature export-xfdf-annot
  • 2.0.0
    • ⭐ Update argument parsing structure to use subparsers for clearer command distinction.
    • add Makefile to install and uninstall script
  • 1.4.0
    • new feature delete-annot: Delete all annots in pdf
  • 1.3.0
    • improve feature import-toc: Support set the first page and fix a gap. See more info here
  • 1.2.0
    • new feature export-annot: Export the annotations of PDF
  • 1.1.0
    • new feature export-toc: Export the toc of pdf to human-readable file. You can see the format here
    • new feature import-toc: Import the toc of pdf, the toc shares the same format with the exported one

Credits

This project is inspired by the following tool:

About

Some useful functions to process pdf file

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published