Skip to content

zopyx/privacyforms.pdf

Repository files navigation

privacyforms-pdf

CI Python 3.14+ Code style: ruff uv

Python library for extracting and filling PDF forms using pypdf.

Features

  • Extract form data from PDF files using pure Python (no external dependencies)
  • Fill PDF forms programmatically using pypdf or pdfcpu
  • Extract field geometry (position and size) information
  • Command-line interface with multiple commands
  • Full type hints and comprehensive test coverage (99%)
  • Support for all form field types (text, date, checkbox, radio button groups, etc.)

Requirements

  • Python 3.14+
  • pypdf >= 5.0
  • pdfcpu >= 0.9 (optional, for --pdfcpu fill-form option)

Optional: Installing pdfcpu

pdfcpu is only required if you want to use the --pdfcpu option for filling forms. It can handle some PDFs that pypdf may have issues with.

# macOS
brew install pdfcpu

# Or download from https://github.com/pdfcpu/pdfcpu/releases
# Make sure pdfcpu is in your PATH

Installation

# Clone the repository
git clone <repo-url>
cd privacyforms-pdf

# Install with uv
uv sync

Quick Start

Check CLI is ready

pdf-forms check

Command Line Usage

# Check if a PDF contains a form
pdf-forms info form.pdf

# List all form fields
pdf-forms list-fields form.pdf

# Get a specific field value
pdf-forms get-value form.pdf "Field Name"

# Extract form data to JSON
pdf-forms extract form.pdf -o output.json

# Extract form data to stdout
pdf-forms extract form.pdf

# Fill a form from JSON (validates before filling)
pdf-forms fill-form form.pdf data.json -o filled.pdf

# Fill a form without validation
pdf-forms fill-form form.pdf data.json -o filled.pdf --no-validate

# Fill a form in-place (modifies original)
pdf-forms fill-form form.pdf data.json

# Fill with strict mode (requires all form fields)
pdf-forms fill-form form.pdf data.json -o filled.pdf --strict

# Fill using pdfcpu instead of pypdf (requires pdfcpu to be installed)
pdf-forms fill-form form.pdf data.json -o filled.pdf --pdfcpu

# Fill using a custom pdfcpu binary path
pdf-forms fill-form form.pdf data.json -o filled.pdf --pdfcpu --pdfcpu-path /usr/local/bin/pdfcpu

JSON Format

The fill-form command accepts a simple key:value JSON format where keys are field names and values are the values to fill:

{
  "Candidate Name": "John Smith",
  "Position": "Software Engineer",
  "Start date": "2025-06-01",
  "Full time": true,
  "Diploma or GED": "Yes"
}

Python API

from privacyforms_pdf import PDFFormExtractor

# Initialize the extractor
extractor = PDFFormExtractor()

# Extract form data
form_data = extractor.extract("form.pdf")

# Access form information
print(f"PDF Version: {form_data.pdf_version}")
print(f"Has Form: {form_data.has_form}")
print(f"Total Fields: {len(form_data.fields)}")

# Iterate over fields
for field in form_data.fields:
    print(f"{field.name}: {field.value}")

# Get specific field value
value = extractor.get_field_value("form.pdf", "Field Name")

# Check if PDF has a form
has_form = extractor.has_form("form.pdf")

# Export to JSON file
extractor.extract_to_json("form.pdf", "output.json")

# Fill a form using simple key:value format
form_data = {
    "Candidate Name": "John Smith",
    "Position": "Software Engineer",
    "Full time": True,
    "Start date": "2025-06-01"
}
extractor.fill_form("form.pdf", form_data, "filled.pdf")

# Or fill from a JSON file
extractor.fill_form_from_json("form.pdf", "data.json", "filled.pdf")

# Validate data before filling (returns list of errors)
errors = extractor.validate_form_data("form.pdf", form_data)
if errors:
    print("Validation errors:", errors)

# Fill a form using pdfcpu (requires pdfcpu to be installed)
# This can be useful when pypdf has issues with certain PDFs
from privacyforms_pdf import is_pdfcpu_available
if is_pdfcpu_available():
    extractor.fill_form_with_pdfcpu("form.pdf", form_data, "filled.pdf")

API Reference

PDFFormExtractor

The main class for extracting and filling PDF form data.

Constructor

extractor = PDFFormExtractor(
    timeout_seconds: float = 30.0,
    extract_geometry: bool = True
)
  • timeout_seconds: Timeout for operations (kept for API compatibility).
  • extract_geometry: Whether to extract field geometry information.

Methods

  • has_form(pdf_path: str | Path) -> bool: Check if a PDF contains a form.
  • extract(pdf_path: str | Path) -> PDFFormData: Extract form data from a PDF.
  • extract_to_json(pdf_path: str | Path, output_path: str | Path) -> None: Export form data to a JSON file.
  • list_fields(pdf_path: str | Path) -> list[PDFField]: List all form fields in a PDF.
  • get_field_value(pdf_path: str | Path, field_name: str) -> str | bool | None: Get the value of a specific form field.
  • get_field_by_id(pdf_path: str | Path, field_id: str) -> PDFField | None: Get a form field by its ID.
  • get_field_by_name(pdf_path: str | Path, field_name: str) -> PDFField | None: Get a form field by its name.
  • validate_form_data(pdf_path: str | Path, form_data: dict, *, strict: bool = False, allow_extra_fields: bool = False) -> list[str]: Validate form data (simple key:value format).
  • fill_form(pdf_path: str | Path, form_data: dict, output_path: str | Path | None = None, *, validate: bool = True) -> Path: Fill a PDF form with data using pypdf.
  • fill_form_from_json(pdf_path: str | Path, json_path: str | Path, output_path: str | Path | None = None, *, validate: bool = True) -> Path: Fill a PDF form with data from a JSON file using pypdf.
  • fill_form_with_pdfcpu(pdf_path: str | Path, form_data: dict, output_path: str | Path | None = None, *, validate: bool = True, pdfcpu_path: str = "pdfcpu") -> Path: Fill a PDF form with data using pdfcpu binary.

Data Classes

PDFFormData

Represents extracted PDF form data.

  • source: Path: Path to the source PDF file.
  • pdf_version: str: Version of the PDF.
  • has_form: bool: Whether the PDF contains a form.
  • fields: list[PDFField]: List of form fields.
  • raw_data: dict[str, Any]: The raw data from pypdf.

PDFField

Represents a single form field.

  • name: str: The name of the field.
  • id: str: The unique identifier of the field.
  • field_type: str: The type of the form field (e.g., 'textfield', 'checkbox').
  • value: str | bool: The current value of the field.
  • pages: list[int]: List of pages where this field appears.
  • locked: bool: Whether the field is locked.
  • geometry: FieldGeometry | None: Optional geometry information (position and size).
  • format: str | None: Date format for datefield types.
  • options: list[str]: Available options for radiobuttongroup, combobox, listbox types.

FieldGeometry

Represents the geometry (position and size) of a form field.

  • page: int: 1-based page number where field appears.
  • rect: tuple[float, float, float, float]: Bounding box as (x1, y1, x2, y2) in PDF points.
  • x: float: Left coordinate.
  • y: float: Bottom coordinate (PDF coordinate system).
  • width: float: Field width in points.
  • height: float: Field height in points.
  • normalized_y: float: Y position quantized to 15-point buckets for row grouping (legacy method).
  • row_y: float: Y position of the row cluster center computed using adaptive clustering (recommended for row grouping).
  • units: str: Unit of measurement (always "pt" for points).

JSON Export Format

When using pdf-forms extract or extract_to_json(), the output JSON has the following structure:

{
  "source": "path/to/form.pdf",
  "pdf_version": "1.7",
  "has_form": true,
  "fields": [
    {
      "name": "Field Name",
      "id": "1",
      "field_type": "textfield",
      "value": "Field Value",
      "pages": [1],
      "locked": false,
      "geometry": {
        "page": 1,
        "rect": [53.0, 1077.0, 414.0, 1104.0],
        "x": 53.0,
        "y": 1077.0,
        "width": 361.0,
        "height": 27.0,
        "normalized_y": 1075.0,
        "row_y": 1075.0,
        "units": "pt"
      },
      "format": null,
      "options": []
    }
  ]
}

Field Types:

  • textfield: Text input fields
  • datefield: Date input fields (may include format attribute)
  • checkbox: Boolean/checkbox fields (value is true or false)
  • radiobuttongroup: Radio button groups (may include options array)
  • combobox: Dropdown/combo boxes (may include options array)
  • listbox: List selection boxes (may include options array)
  • signature: Signature fields

Geometry: The geometry object contains the field's position and size in PDF points (1/72 inch):

  • rect: Array of [x0, y0, x1, y1] coordinates
  • x, y: Bottom-left corner position
  • width, height: Field dimensions
  • normalized_y: Y position quantized to 15-point buckets (fields within ±15 points share the same value)
  • row_y: Y position of the row cluster center using adaptive clustering analysis (recommended for row grouping - groups fields more intelligently based on the distribution of all positions)
  • Note: PDF coordinates have origin (0,0) at bottom-left of the page
  • width: float: Field width in points.
  • height: float: Field height in points.

Exceptions

  • PDFFormError: Base exception for PDF form related errors.
  • PDFFormNotFoundError: Raised when the PDF does not contain any forms.
  • FormValidationError: Raised when form data validation fails.
  • FieldNotFoundError: Raised when a field is not found in the form.

Note: For backwards compatibility, the following aliases are still available but deprecated:

  • PDFCPUError (alias for PDFFormError)
  • PDFCPUNotFoundError (alias for PDFFormError)
  • PDFCPUExecutionError (alias for PDFFormError)

Utility Functions

  • is_pdfcpu_available(pdfcpu_path: str = "pdfcpu") -> bool: Check if pdfcpu binary is available in the system PATH.
  • get_available_geometry_backends() -> list[str]: Return list of available geometry backends (always ["pypdf"]).
  • has_geometry_support() -> bool: Check if geometry extraction is supported (always True).

Development

Running Tests

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov

# Run linting
uv run ruff check .

# Run type checking
uv run ty check

Project Structure

privacyforms-pdf/
├── privacyforms_pdf/       # Main package
│   ├── __init__.py         # Package exports
│   ├── extractor.py        # PDFFormExtractor implementation
│   └── cli.py              # Command-line interface
├── tests/                  # Test suite
│   ├── test_extractor.py   # Tests for extractor
│   └── test_cli.py         # Tests for CLI
├── pyproject.toml          # Project configuration
└── README.md               # This file

License

Copyright 2026 Andreas Jung (info@zopyx.com)

About

PDF related functionality for Privacy Forms Studio

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors