A Python package for extracting and inferring metadata from PDF documents using AI-powered analysis.
- Extract metadata from PDF files
- AI-powered document analysis using LLMs
- CLI tool for easy batch processing
- Flexible configuration and output formatting
- Structured metadata models using Pydantic
- Python 3.12 or higher
- Ollama - Required for AI-powered analysis
- Install Ollama
- Pull a model:
ollama pull gemma3:4b
- See
pyproject.tomlfor full Python dependency list
pip install git+https://github.com/tidyeval/docinfer.gitClone the repository and install in editable mode:
git clone https://github.com/tidyeval/docinfer.git
cd docinfer
pip install -e .Run directly without installation using uvx:
uvx --from git+https://github.com/tidyeval/docinfer.git docinfer <path-to-pdf>Note: Once published to PyPI, you'll be able to run
uvx docinfer <path-to-pdf>directly.
If you've installed the package locally, run directly:
docinfer <path-to-pdf>--model MODEL- Specify the Ollama model (default:gemma3:4b)- Example:
docinfer document.pdf --model gemma2
- Example:
--json- Output as JSON instead of formatted text--no-ai- Skip AI analysis and show embedded metadata only--export FILE- Export results to JSON file--quiet- Suppress progress output
from docinfer.services.pdf_extractor import PDFExtractor
from docinfer.services.ai_analyzer import AIAnalyzer
# Extract PDF content
extractor = PDFExtractor()
content = extractor.extract("document.pdf")
# Analyze with AI
analyzer = AIAnalyzer()
metadata = analyzer.analyze(content)docinfer/
├── src/
│ ├── cli.py # Command-line interface
│ ├── models/ # Pydantic data models
│ ├── services/ # Core services (PDF extraction, AI analysis)
│ └── prompts/ # AI prompt templates
├── tests/ # Unit and integration tests
├── specs/ # Project specifications
├── pyproject.toml # Project configuration
└── README.md # This file
-
Clone the repository:
git clone https://github.com/tidyeval/docinfer.git cd docinfer -
Create and activate virtual environment:
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install in development mode:
pip install -e ".[dev]"
pytestThe project uses:
- black for code formatting
- ruff for linting
- pytest for testing
Contributions are welcome! Please ensure:
- Code passes linting and formatting checks
- Tests pass with good coverage
- Commit messages are descriptive
See LICENSE file for details.
Tino Kanngiesser (tinokanngiesser@gmail.com)