A community-supported supercharged version of paperless: scan, index and archive all your physical documents
-
Updated
Nov 11, 2024 - Python
A community-supported supercharged version of paperless: scan, index and archive all your physical documents
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://gpt-docs.h2o.ai/
A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
Get your documents ready for gen AI
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
borb is a library for reading, creating and manipulating PDF files in python.
Parse files for optimal RAG
💀 Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator or Interact.sh
Open Source Document Management System for Digital Archives (Scanned Documents)
Add a description, image, and links to the pdf topic page so that developers can more easily learn about it.
To associate your repository with the pdf topic, visit your repo's landing page and select "manage topics."