pdf

Here are 2,156 public repositories matching this topic...

paperless-ngx / paperless-ngx

A community-supported supercharged version of paperless: scan, index and archive all your physical documents

pdf machine-learning django angular ocr archiving dms document-management optical-character-recognition document-management-system

Updated Nov 11, 2024
Python

ocrmypdf / OCRmyPDF

Star

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

python pdf ocr image-processing tesseract

Updated Nov 10, 2024
Python

opendatalab / MinerU

Star

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具，支持PDF/网页/多格式电子书提取。

python pdf parser ocr pdf-converter extract-data document-analysis pdf-parser layout-analysis ai4science pdf-extractor-rag pdf-extractor-llm pdf-extractor-pretrain

Updated Nov 11, 2024
Python

h2oai / h2ogpt

Star

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://gpt-docs.h2o.ai/

pdf ai embeddings private gpt generative llm chatgpt gpt4all vectorstore privategpt llama2 mixtral

Updated Nov 7, 2024
Python

py-pdf / pypdf

Star

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

python pdf help-wanted pdf-documents pypdf2 pdf-manipulation pdf-parsing pdf-parser

Updated Nov 8, 2024
Python

DS4SD / docling

Star

Get your documents ready for gen AI

html markdown pdf ai convert pdf-converter docx documents pptx pdf-to-text tables document-parser pdf-to-json document-parsing

Updated Nov 10, 2024
Python

Kozea / WeasyPrint

Star

The awesome document factory

css python html pdf converter weasyprint

Updated Oct 29, 2024
Python

jsvine / pdfplumber

Star

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

pdf pdf-parsing table-extraction

Updated Nov 11, 2024
Python

getomni-ai / zerox

Star

PDF to Markdown with vision models

pdf ocr

Updated Nov 10, 2024
Python

pdfminer / pdfminer.six

Star

Community maintained fork of pdfminer - we fathom PDF

python pdf parser

Updated Aug 2, 2024
Python

pymupdf / PyMuPDF

Star

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

python pdf font data-science ocr tesseract epub mupdf text-processing pdf-documents extract-data table-extraction text-shaping xps pymupdf

Updated Nov 5, 2024
Python

atlanhq / camelot

Star

Camelot: PDF Table Extraction for Humans

pdf table extract for-humans

Updated Jan 5, 2023
Python

pdfarranger / pdfarranger

Star

Small python-gtk application, which helps the user to merge or split PDF documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface.

linux pdf gtk python3 gtk3