pdf-to-text

Here are 90 public repositories matching this topic...

docling-project / docling

Get your documents ready for gen AI

html markdown pdf ai convert xlsx pdf-converter docx documents pptx pdf-to-text tables document-parser pdf-to-json document-parsing

Updated Jul 16, 2025
Python

Unstructured-IO / unstructured

Star

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

Updated Jul 16, 2025
HTML

run-llama / llama_cloud_services

Star

Knowledge Agents and Management in the Cloud

pdf parsing document pptx structured-data pdf-to-text pdf-to-excel tables docx-to-markdown document-parser pdf-document-processor pdf-to-json document-parsing ppt-to-json pdf-to-markdown ppt-to-markdown

Updated Jul 16, 2025
Python

enoch3712 / ExtractThinker

Star

ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

python nlp pdf machine-learning ocr ai openai pdf-to-text document-processing document-image-analysis document-intelligence llm document-parsing langchain

Updated Jun 9, 2025
Python

Academic-Hammer / SciTSR

Star

Table structure recognition dataset of the paper: Complicated Table Structure Recognition

pdf-to-text pdf2txt table-structure-recognition

Updated Jul 7, 2020
Python

pd3f / pd3f

Star

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

python pdf machine-learning ocr pipeline text-extraction pdf-to-text language-model extract-text parsr pd3f

Updated Oct 13, 2023
HTML

shoryasethia / markdrop

Sponsor

Star

A Python package for converting PDFs to markdown while extracting images and tables, generate descriptive text descriptions for extracted tables/images using several LLM clients. And many more functionalities. Markdrop is available on PyPI.

open-source pdf-to-text image-to-text marker agents pypi-package table-to-text markitdown llm pdf-to-markdown docling markdrop

Updated Jul 5, 2025
Python

GiftMungmeeprued / document-parsers-list

Star

A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts.

pdf ocr preprocessing pdf-to-text document-image-processing data-pipeline document-parser document-parsing langchain

Updated Jul 14, 2025

NanoNets / ocr-python

Star

OCR library to extract text & tables from PDF files and images. Convert any image or PDF to CSV / TXT / JSON / Searchable PDF.

python pdf ocr tesseract pdf-to-text image-to-text textract pdf-to-csv pdf-to-json searchable-pdf pytesseract-ocr extract-table table-extract image-to-text-converter extract-text-from-image extract-text-from-pdf

Updated Dec 2, 2022
Jupyter Notebook

nainiayoub / pdf-text-data-extractor

Star

PDF text data extraction web app with OCR for scanned documents

python pdf ocr text-extraction pdf-to-text ocr-text-reader ocr-python streamlit streamlit-webapp

Updated Jun 5, 2024
Python

datalogics / adobe-pdf-library-samples

Star

Sample code for the Datalogics C++, Java, and .NET interfaces of the Adobe PDF Library

pdf ocr pdf-converter pdf-document pdf-conversion pdf-generation pdf-to-text pdf-manipulation pdfa pdf-split pdf-merger pdf-parser pdf-to-image pdf-tools pdf-compression pdf-lib pdf-render ocr-pdf pdf-to-office

Updated May 22, 2023

BitMiracle / Docotic.Pdf.Samples

Star

C# and VB.NET samples for Docotic.Pdf library

Updated Jul 1, 2025
Visual Basic .NET

galkahana / pdf-text-extraction

Star

cli for extracting text from PDF files (and maybe possibly tables)

pdf pdf-to-text

Updated Jun 14, 2025
C++

papercast-dev / papercast

Star

A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.

python nlp pipeline podcast pdf-converter tts arxiv pdf-to-text dag document-parser pdf-document-processor grobid semantic-scholar document-parsing

Updated Mar 17, 2025
Python

mbzuai-oryx / KITAB-Bench

Star

[ACL 2025 🔥] A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding

benchmark ocr vqa pdf-to-text arabic table-detection layout-detection vlms

Updated May 24, 2025
Python

iditectweb / converter

Star

Standalone .NET Converter library, not require Adobe Acrobat component nor Microsoft Office Interop Assemblies, to convert PDF, DOCX, XLSX, HTML, Image, CSV, RTF, TXT in .NET framework